Timetk Functions
Timetk Functions
1
2 R topics documented:
R topics documented:
timetk-package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
anomalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
between_time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
bike_sharing_daily . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
box_cox_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
condense_period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
diff_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
FANG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
filter_by_time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
filter_period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
fourier_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
future_frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
is_date_class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
lag_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
log_interval_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
m4_daily . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
m4_hourly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
m4_monthly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
m4_quarterly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
m4_weekly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
m4_yearly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
mutate_by_time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
normalize_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
pad_by_time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
parse_date2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
plot_acf_diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
plot_anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
plot_anomaly_diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
plot_seasonal_diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
plot_stl_diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
plot_time_series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
plot_time_series_boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
plot_time_series_cv_plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
plot_time_series_regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
set_tk_time_scale_template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
slice_period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
slidify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
slidify_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
smooth_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
standardize_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
step_box_cox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
step_diff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
step_fourier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
step_holiday_signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
step_log_interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
step_slidify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
R topics documented: 3
step_slidify_augment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
step_smooth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
step_timeseries_signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
step_ts_clean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
step_ts_impute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
step_ts_pad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
summarise_by_time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
taylor_30_min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
time_arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
time_series_cv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
time_series_split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
tk_acf_diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
tk_anomaly_diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
tk_augment_differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
tk_augment_fourier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
tk_augment_holiday . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
tk_augment_lags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
tk_augment_slidify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
tk_augment_timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
tk_get_frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
tk_get_holiday . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
tk_get_timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
tk_get_timeseries_unit_frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
tk_get_timeseries_variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
tk_index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
tk_make_future_timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
tk_make_holiday_sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
tk_make_timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
tk_seasonal_diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
tk_stl_diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
tk_summary_diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
tk_tbl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
tk_time_series_cv_plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
tk_ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
tk_tsfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
tk_xts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
tk_zoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
tk_zooreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
ts_clean_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
ts_impute_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
walmart_sales_weekly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
wikipedia_traffic_daily . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Index 182
4 anomalize
Description
The timetk package combines a collection of coercion tools for time series analysis.
Details
Author(s)
See Also
Useful links:
• https://github.com/business-science/timetk
• https://business-science.github.io/timetk/
• Report bugs at https://github.com/business-science/timetk/issues
Description
anomalize() is used to detect anomalies in time series data, either for a single time series or for
multiple time series grouped by a specific column.
anomalize 5
Usage
anomalize(
.data,
.date_var,
.value,
.frequency = "auto",
.trend = "auto",
.method = "stl",
.iqr_alpha = 0.05,
.clean_alpha = 0.75,
.max_anomalies = 0.2,
.message = TRUE
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
.frequency Controls the seasonal adjustment (removal of seasonality). Input can be either
"auto", a time-based definition (e.g. "2 weeks"), or a numeric number of obser-
vations per frequency (e.g. 10). Refer to tk_get_frequency().
.trend Controls the trend component. For STL, trend controls the sensitivity of the
LOESS smoother, which is used to remove the remainder. Refer to tk_get_trend().
.method The outlier detection method. Default: "stl". Currently "stl" is the only method.
"twitter" is planned.
.iqr_alpha Controls the width of the "normal" range. Lower values are more conservative
while higher values are less prone to incorrectly classifying "normal" observa-
tions.
.clean_alpha Controls the threshold for cleaning the outliers. The default is 0.75, which means
that the anomalies will be cleaned using the 0.75 * lower or upper bound of the
recomposed time series, depending on the direction of the anomaly.
.max_anomalies The maximum percent of anomalies permitted to be identified.
.message A boolean. If TRUE, will output information related to automatic frequency and
trend selection (if applicable).
Details
The anomalize() method for anomaly detection that implements a 2-step process to detect outliers
in time series.
Step 1: Detrend & Remove Seasonality using STL Decomposition
The decomposition separates the "season" and "trend" components from the "observed" values leav-
ing the "remainder" for anomaly detection.
The user can control two parameters: frequency and trend.
6 anomalize
1. .frequency: Adjusts the "season" component that is removed from the "observed" values.
2. .trend: Adjusts the trend window (t.window parameter from stats::stl() that is used.
The user may supply both .frequency and .trend as time-based durations (e.g. "6 weeks") or
numeric values (e.g. 180) or "auto", which predetermines the frequency and/or trend based on the
scale of the time series using the tk_time_scale_template().
Step 2: Anomaly Detection
Once "trend" and "season" (seasonality) is removed, anomaly detection is performed on the "re-
mainder". Anomalies are identified, and boundaries (recomposed_l1 and recomposed_l2) are de-
termined.
The Anomaly Detection Method uses an inner quartile range (IQR) of +/-25 the median.
IQR Adjustment, alpha parameter
With the default alpha = 0.05, the limits are established by expanding the 25/75 baseline by an
IQR Factor of 3 (3X). The IQR Factor = 0.15 / alpha (hence 3X with alpha = 0.05):
• To increase the IQR Factor controlling the limits, decrease the alpha, which makes it more
difficult to be an outlier.
• Increase alpha to make it easier to be an outlier.
• The IQR outlier detection method is used in forecast::tsoutliers().
• A similar outlier detection method is used by Twitter’s AnomalyDetection package.
• Both Twitter and Forecast tsoutliers methods have been implemented in Business Science’s
anomalize package.
Value
A tibble or data.frame with the following columns:
• observed: original data
• seasonal: seasonal component
• seasadaj: seasonal adjusted
• trend: trend component
• remainder: residual component
• anomaly: Yes/No flag for outlier detection
• anomaly score: distance from centerline
• anomaly direction: -1, 0, 1 inidicator for direction of the anomaly
• recomposed_l1: lower level bound of recomposed time series
• recomposed_l2: upper level bound of recomposed time series
• observed_clean: original data with anomalies interpolated
References
1. CLEVELAND, R. B., CLEVELAND, W. S., MCRAE, J. E., AND TERPENNING, I. STL: A
Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, Vol.
6, No. 1 (1990), pp. 3-73.
2. Owen S. Vallis, Jordan Hochenbaum and Arun Kejariwal (2014). A Novel Technique for
Long-Term Anomaly Detection in the Cloud. Twitter Inc.
between_time 7
Examples
library(dplyr)
walmart_sales_weekly %>%
filter(id %in% c("1_1", "1_3")) %>%
group_by(id) %>%
anomalize(Date, Weekly_Sales)
between_time Between (For Time Series): Range detection for date or date-time se-
quences
Description
The easiest way to filter time series date or date-time vectors. Returns a logical vector indicating
which date or date-time values are within a range. See filter_by_time() for the data.frame
(tibble) implementation.
Usage
between_time(index, start_date = "start", end_date = "end")
Arguments
index A date or date-time vector.
start_date The starting date
end_date The ending date
Details
Pure Time Series Filtering Flexibilty
The start_date and end_date parameters are designed with flexibility in mind.
Each side of the time_formula is specified as the character 'YYYY-MM-DD HH:MM:SS', but powerful
shorthand is available. Some examples are:
Internal Calculations
All shorthand dates are expanded:
This means that the following examples are equivalent (assuming your index is a POSIXct):
Value
A logical vector the same length as index indicating whether or not the timestamp value was
within the start_date and end_date range.
References
• This function is based on the tibbletime::filter_time() function developed by Davis
Vaughan.
See Also
Time-Based dplyr functions:
Examples
library(dplyr)
# How it works
# - Returns TRUE/FALSE length of index
# - Use sum() to tally the number of TRUE values
bike_sharing_daily 9
# Minute Series:
index_min[index_min %>% between_time("2016-02-01 12:00", "2016-02-01 13:00")]
Description
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in
Capital bikeshare system with the corresponding weather and seasonal information.
Usage
bike_sharing_daily
Format
A tibble: 731 x 16
• instant: record index
• dteday : date
• season : season (1:winter, 2:spring, 3:summer, 4:fall)
• yr : year (0: 2011, 1:2012)
• mnth : month ( 1 to 12)
• hr : hour (0 to 23)
• holiday : weather day is holiday or not
• weekday : day of the week
• workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
• weathersit :
– 1: Clear, Few clouds, Partly cloudy, Partly cloudy
10 box_cox_vec
References
Fanaee-T, Hadi, and Gama, Joao, ’Event labeling combining ensemble detectors and background
knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.
Examples
bike_sharing_daily
Description
This is mainly a wrapper for the BoxCox transformation from the forecast R package. The
box_cox_vec() function performs the transformation. box_cox_inv_vec() inverts the transfor-
mation. auto_lambda() helps in selecting the optimal lambda value.
Usage
box_cox_vec(x, lambda = "auto", silent = FALSE)
box_cox_inv_vec(x, lambda)
auto_lambda(
x,
method = c("guerrero", "loglik"),
lambda_lower = -1,
lambda_upper = 2
)
box_cox_vec 11
Arguments
x A numeric vector.
lambda The box cox transformation parameter. If set to "auto", performs automated
lambda selection using auto_lambda().
silent Whether or not to report the automated lambda selection as a message.
method The method used for automatic lambda selection. Either "guerrero" or "loglik".
lambda_lower A lower limit for automatic lambda selection
lambda_upper An upper limit for automatic lambda selection
Details
The Box Cox transformation is a power transformation that is commonly used to reduce variance
of a time series.
Automatic Lambda Selection
If desired, the lambda argument can be selected using auto_lambda(), a wrapper for the Forecast
R Package’s forecast::BoxCox.lambda() function. Use either of 2 methods:
Value
Returns a numeric vector that has been transformed.
References
• Forecast R Package
• Forecasting: Principles & Practices: Transformations & Adjustments
• Guerrero, V.M. (1993) Time-series analysis supported by power transformations. Journal of
Forecasting, 12, 37–48.
See Also
• Box Cox Transformation: box_cox_vec()
• Lag Transformation: lag_vec()
• Differencing Transformation: diff_vec()
• Rolling Window Transformation: slidify_vec()
• Loess Smoothing Transformation: smooth_vec()
• Fourier Series: fourier_vec()
• Missing Value Imputation for Time Series: ts_impute_vec(), ts_clean_vec()
Other common transformations to reduce variance: log(), log1p() and sqrt()
12 condense_period
Examples
library(dplyr)
d10_daily <- m4_daily %>% dplyr::filter(id == "D10")
m4_daily %>%
dplyr::group_by(id) %>%
dplyr::mutate(value_bc = box_cox_vec(value))
Description
Convert a data.frame object from daily to monthly, from minute data to hourly, and more. This
allows the user to easily aggregate data to a less granular level by taking the value from either the
beginning or end of the period.
Usage
condense_period(.data, .date_var, .period = "1 day", .side = c("start", "end"))
Arguments
.data A tbl object or data.frame
.date_var A column containing date or date-time values. If missing, attempts to auto-
detect date column.
.period A period to condense the time series to. Time units are condensed using lubridate::floor_date()
or lubridate::ceiling_date().
The value can be:
• second
• minute
• hour
• day
• week
• month
• bimonth
• quarter
condense_period 13
• season
• halfyear
• year
Arbitrary unique English abbreviations as in the lubridate::period() con-
structor are allowed:
• "1 year"
• "2 months"
• "30 seconds"
.side One of "start" or "end". Determines if the first observation in the period should
be returned or the last.
Value
A tibble or data.frame
See Also
Time-Based dplyr functions:
• summarise_by_time() - Easily summarise using a date column.
• mutate_by_time() - Simplifies applying mutations by time windows.
• pad_by_time() - Insert time series rows with regularly spaced timestamps
• filter_by_time() - Quickly filter using date ranges.
• filter_period() - Apply filtering expressions inside periods (windows)
• slice_period() - Apply slice inside periods (windows)
• condense_period() - Convert to a different periodicity
• between_time() - Range detection for date or date-time sequences.
• slidify() - Turn any function into a sliding (rolling) function
Examples
# Libraries
library(dplyr)
Description
diff_vec() applies a Differencing Transformation. diff_inv_vec() inverts the differencing trans-
formation.
Usage
diff_vec(
x,
lag = 1,
difference = 1,
log = FALSE,
initial_values = NULL,
silent = FALSE
)
Arguments
x A numeric vector to be differenced or inverted.
lag Which lag (how far back) to be included in the differencing calculation.
difference The number of differences to perform.
• 1 Difference is equivalent to measuring period change.
• 2 Differences is equivalent to measuring period acceleration.
log If log differences should be calculated. Note that difference inversion of a log-
difference is approximate.
initial_values Only used in the diff_vec_inv() operation. A numeric vector of the initial
values, which are used to invert differences. This vector is the original values
that are the length of the NA missing differences.
silent Whether or not to report the initial values used to invert the difference as a
message.
Details
Benefits:
This function is NA padded by default so it works well with dplyr::mutate() operations.
Difference Calculation
Single differencing, diff_vec(x_t) is equivalent to: x_t - x_t1, where the subscript _t1 indicates
the first lag. This transformation can be interpereted as change.
Double Differencing Calculation
diff_vec 15
Double differencing, diff_vec(x_t, difference = 2) is equivalent to: (x_t - x_t1) - (x_t - x_t1)_t1,
where the subscript _t1 indicates the first lag. This transformation can be interpereted as accelera-
tion.
Log Difference Calculation
Log differencing, diff_vec(x_t, log = TRUE) is equivalent to: log(x_t) - log(x_t1) = log(x_t
/ x_t1), where x_t is the series and x_t1 is the first lag.
The 1st difference diff_vec(difference = 1, log = TRUE) has an interesting property where diff_vec(difference
= 1, log = TRUE) %>% exp() is approximately 1 + rate of change.
Value
A numeric vector
See Also
Advanced Differencing and Modeling:
Examples
library(dplyr)
# Get Change
1:10 %>% diff_vec()
# Get Acceleration
1:10 %>% diff_vec(difference = 2)
m4_daily %>%
group_by(id) %>%
mutate(difference = diff_vec(value, lag = 1)) %>%
mutate(
difference_inv = diff_inv_vec(
difference,
lag = 1,
# Add initial value to calculate the inverse difference
initial_values = value[1]
)
)
Description
A dataset containing the daily historical stock prices for the "FANG" tech stocks, "FB", "AMZN",
"NFLX", and "GOOG", spanning from the beginning of 2013 through the end of 2016.
Usage
FANG
Format
A "tibble" ("tidy" data frame) with 4,032 rows and 8 variables:
Description
The easiest way to filter time-based start/end ranges using shorthand timeseries notation. See
filter_period() for applying filter expression by period (windows).
Usage
filter_by_time(.data, .date_var, .start_date = "start", .end_date = "end")
Arguments
.data A tibble with a time-based column.
.date_var A column containing date or date-time values to filter. If missing, attempts to
auto-detect date column.
.start_date The starting date for the filter sequence
.end_date The ending date for the filter sequence
Details
Pure Time Series Filtering Flexibilty
The .start_date and .end_date parameters are designed with flexibility in mind.
Each side of the time_formula is specified as the character 'YYYY-MM-DD HH:MM:SS', but powerful
shorthand is available. Some examples are:
• Year: .start_date = '2013', .end_date = '2015'
• Month: .start_date = '2013-01', .end_date = '2016-06'
• Day: .start_date = '2013-01-05', .end_date = '2016-06-04'
• Second: .start_date = '2013-01-05 10:22:15', .end_date = '2018-06-03 12:14:22'
• Variations: .start_date = '2013', .end_date = '2016-06'
Key Words: "start" and "end"
Use the keywords "start" and "end" as shorthand, instead of specifying the actual start and end
values. Here are some examples:
• Start of the series to end of 2015: .start_date = 'start', .end_date = '2015'
• Start of 2014 to end of series: .start_date = '2014', .end_date = 'end'
Internal Calculations
All shorthand dates are expanded:
• The .start_date is expanded to be the first date in that period
• The .end_date side is expanded to be the last date in that period
18 filter_by_time
This means that the following examples are equivalent (assuming your index is a POSIXct):
Value
References
See Also
Examples
library(dplyr)
Description
Applies a dplyr filtering expression inside a time-based period (window). See filter_by_time()
for filtering continuous ranges defined by start/end dates. filter_period() enables filtering ex-
pressions like:
• Filtering to the maximum value each month.
• Filtering the first date each month.
• Filtering all rows with value greater than a monthly average
Usage
filter_period(.data, ..., .date_var, .period = "1 day")
Arguments
.data A tbl object or data.frame
... Filtering expression. Expressions that return a logical value, and are defined in
terms of the variables in .data. If multiple expressions are included, they are
combined with the & operator. Only rows for which all conditions evaluate to
TRUE are kept.
.date_var A column containing date or date-time values. If missing, attempts to auto-
detect date column.
.period A period to filter within. Time units are grouped using lubridate::floor_date()
or lubridate::ceiling_date().
The value can be:
• second
• minute
• hour
• day
• week
• month
• bimonth
• quarter
• season
• halfyear
• year
Arbitrary unique English abbreviations as in the lubridate::period() con-
structor are allowed:
• "1 year"
• "2 months"
• "30 seconds"
20 fourier_vec
Value
A tibble or data.frame
See Also
Time-Based dplyr functions:
Examples
# Libraries
library(dplyr)
Description
fourier_vec() calculates a Fourier Series from a date or date-time index.
fourier_vec 21
Usage
fourier_vec(x, period, K = 1, type = c("sin", "cos"), scale_factor = NULL)
Arguments
x A date, POSIXct, yearmon, yearqtr, or numeric sequence (scaled to difference 1
for period alignment) to be converted to a fourier series.
period The number of observations that complete one cycle.
K The fourier term order.
type Either "sin" or "cos" for the appropriate type of fourier term.
scale_factor Scale factor is a calculated value that scales date sequences to numeric se-
quences. A user can provide a different value of scale factor to override the
date scaling. Default: NULL (auto-scale).
Details
Benefits:
This function is NA padded by default so it works well with dplyr::mutate() operations.
Fourier Series Calculation
The internal calculation is relatively straightforward: fourier(x) = sin(2 * pi * term * x) or cos(2 * pi * term * x),
where term = K / period.
Period Alignment, period
The period alignment with the sequence is an essential part of fourier series calculation.
• Date, Date-Time, and Zoo (yearqtr and yearmon) Sequences - Are scaled to unit difference
of 1. This happens internally, so there’s nothing you need to do or to worry about. Future time
series will be scaled appropriately.
• Numeric Sequences - Are not scaled, which means you should transform them to a unit
difference of 1 so that your x is a sequence that increases by 1. Otherwise your period and
fourier order will be incorrectly calculated. The solution is to just take your sequence and
divide by the median difference between values.
Fourier Order, K
The fourier order is a parameter that increases the frequency. K = 2 doubles the frequency. It’s
common in time series analysis to add multiple fourier orders (e.g. 1 through 5) to account for
seasonalities that occur faster than the primary seasonality.
Type (Sin/Cos)
The type of the fourier series can be either sin or cos. It’s common in time series analysis to add
both sin and cos series.
Value
A numeric vector
22 fourier_vec
See Also
Fourier Modeling Functions:
• step_fourier() - Recipe for tidymodels workflow
• tk_augment_fourier() - Adds many fourier series to a data.frame (tibble)
Additional Vector Functions:
• Fourier Series: fourier_vec()
• Box Cox Transformation: box_cox_vec()
• Lag Transformation: lag_vec()
• Differencing Transformation: diff_vec()
• Rolling Window Transformation: slidify_vec()
• Loess Smoothing Transformation: smooth_vec()
• Missing Value Imputation for Time Series: ts_impute_vec(), ts_clean_vec()
Examples
library(dplyr)
# Set max.print to 50
options_old <- options()$max.print
options(max.print = 50)
options(max.print = options_old)
future_frame 23
Description
Make future time series from existing
Usage
future_frame(
.data,
.date_var,
.length_out,
.inspect_weekdays = FALSE,
.inspect_months = FALSE,
.skip_values = NULL,
.insert_values = NULL,
.bind_data = FALSE
)
Arguments
.data A data.frame or tibble
.date_var A date or date-time variable.
.length_out Number of future observations. Can be numeric number or a phrase like "1
year".
.inspect_weekdays
Uses a logistic regression algorithm to inspect whether certain weekdays (e.g.
weekends) should be excluded from the future dates. Default is FALSE.
.inspect_months
Uses a logistic regression algorithm to inspect whether certain days of months
(e.g. last two weeks of year or seasonal days) should be excluded from the future
dates. Default is FALSE.
.skip_values A vector of same class as idx of timeseries values to skip.
.insert_values A vector of same class as idx of timeseries values to insert.
.bind_data Whether or not to perform a row-wise bind of the .data and the future data.
Default: FALSE
Details
This is a wrapper for tk_make_future_timeseries() that works on data.frames. It respects dplyr
groups.
Specifying Length of Future Observations
The argument .length_out determines how many future index observations to compute. It can be
specified as:
24 future_frame
• The .inspect_weekdays argument is useful in determining missing days of the week that oc-
cur on a weekly frequency such as every week, every other week, and so on. It’s recommended
to have at least 60 days to use this option.
• The .inspect_months argument is useful in determining missing days of the month, quarter
or year; however, the algorithm can inadvertently select incorrect dates if the pattern is erratic.
• The .skip_values argument useful for passing holidays or special index values that should
be excluded from the future time series.
• The .insert_values argument is useful for adding values back that the algorithm may have
excluded.
df %>%
future_frame(.length_out = "6 months") %>%
bind_rows(df, .)
df %>%
future_frame(.length_out = "6 months", .bind_data = TRUE)
Value
A tibble that has been extended with future date, date-time timestamps.
is_date_class 25
See Also
• Making Future Time Series: tk_make_future_timeseries() (Underlying function)
Examples
library(dplyr)
FANG %>%
group_by(symbol) %>%
future_frame(
.length_out = "1 year",
.skip_values = c(holidays, weekends)
)
Description
Usage
is_date_class(x)
Arguments
x A vector to check
Value
Logical (TRUE/FALSE)
Examples
library(dplyr)
Description
Usage
lag_vec(x, lag = 1)
Arguments
x A vector to be lagged.
lag Which lag (how far back) to be included in the differencing calculation. Nega-
tive lags are leads.
lag_vec 27
Details
Benefits:
This function is NA padded by default so it works well with dplyr::mutate() operations. The
function allows both lags and leads (negative lags).
Lag Calculation
A lag is an offset of lag periods. NA values are returned for the number of lag periods.
Lead Calculation
A negative lag is considered a lead. The only difference between lead_vec() and lag_vec() is
that the lead_vec() function contains a starting negative value.
Value
A numeric vector
See Also
Modeling and Advanced Lagging:
Vectorized Transformations:
Examples
library(dplyr)
# Lag
1:10 %>% lag_vec(lag = 1)
# Lead
1:10 %>% lag_vec(lag = -1)
m4_daily %>%
28 log_interval_vec
group_by(id) %>%
mutate(lag_1 = lag_vec(value, lag = 1))
Description
The log_interval_vec() transformation constrains a forecast to an interval specified by an upper_limit
and a lower_limit. The transformation provides similar benefits to log() transformation, while
ensuring the inverted transformation stays within an upper and lower limit.
Usage
log_interval_vec(
x,
limit_lower = "auto",
limit_upper = "auto",
offset = 0,
silent = FALSE
)
Arguments
x A positive numeric vector.
limit_lower A lower limit. Must be less than the minimum value. If set to "auto", selects
zero.
limit_upper An upper limit. Must be greater than the maximum value. If set to "auto", selects
a value that is 10% greater than the maximum value.
offset An offset to include in the log transformation. Useful when the data contains
values less than or equal to zero.
silent Whether or not to report the parameter selections as a message.
Details
Log Interval Transformation
The Log Interval Transformation constrains values to specified upper and lower limits. The trans-
formation maps limits to a function:
log(((x + offset) - a)/(b - (x + offset)))
where a is the lower limit and b is the upper limit
Inverse Transformation
The inverse transformation:
(b-a)*(exp(x)) / (1 + exp(x)) + a - offset
m4_daily 29
Value
A numeric vector of the transformed series.
References
• Forecasting: Principles & Practices: Forecasts constrained to an interval
See Also
• Box Cox Transformation: box_cox_vec()
• Lag Transformation: lag_vec()
• Differencing Transformation: diff_vec()
• Rolling Window Transformation: slidify_vec()
• Loess Smoothing Transformation: smooth_vec()
• Fourier Series: fourier_vec()
• Missing Value Imputation & Anomaly Cleaning for Time Series: ts_impute_vec(), ts_clean_vec()
Other common transformations to reduce variance: log(), log1p() and sqrt()
Examples
library(dplyr)
values_trans_forecast %>%
log_interval_inv_vec(limit_lower = 0, limit_upper = 11) %>%
plot()
Description
The fourth M Competition. M4, started on 1 January 2018 and ended in 31 May 2018. The compe-
tition included 100,000 time series datasets. This dataset includes a sample of 4 daily time series
from the competition.
Usage
m4_daily
30 m4_hourly
Format
A tibble: 9,743 x 3
• id Factor. Unique series identifier (4 total)
• date Date. Timestamp information. Daily format.
• value Numeric. Value at the corresponding timestamp.
Details
This is a sample of 4 daily data sets from the M4 competition.
Source
• M4 Competition Website
Examples
m4_daily
Description
The fourth M Competition. M4, started on 1 January 2018 and ended in 31 May 2018. The compe-
tition included 100,000 time series datasets. This dataset includes a sample of 4 hourly time series
from the competition.
Usage
m4_hourly
Format
A tibble: 3,060 x 3
• id Factor. Unique series identifier (4 total)
• date Date-time. Timestamp information. Hourly format.
• value Numeric. Value at the corresponding timestamp.
Details
This is a sample of 4 hourly data sets from the M4 competition.
Source
• M4 Competition Website
m4_monthly 31
Examples
m4_hourly
Description
The fourth M Competition. M4, started on 1 January 2018 and ended in 31 May 2018. The com-
petition included 100,000 time series datasets. This dataset includes a sample of 4 monthly time
series from the competition.
Usage
m4_monthly
Format
A tibble: 9,743 x 3
Details
Source
• M4 Competition Website
Examples
m4_monthly
32 m4_weekly
Description
The fourth M Competition. M4, started on 1 January 2018 and ended in 31 May 2018. The com-
petition included 100,000 time series datasets. This dataset includes a sample of 4 quarterly time
series from the competition.
Usage
m4_quarterly
Format
A tibble: 9,743 x 3
• id Factor. Unique series identifier (4 total)
• date Date. Timestamp information. Quarterly format.
• value Numeric. Value at the corresponding timestamp.
Details
This is a sample of 4 Quarterly data sets from the M4 competition.
Source
• M4 Competition Website
Examples
m4_quarterly
Description
The fourth M Competition. M4, started on 1 January 2018 and ended in 31 May 2018. The com-
petition included 100,000 time series datasets. This dataset includes a sample of 4 weekly time
series from the competition.
Usage
m4_weekly
m4_yearly 33
Format
A tibble: 9,743 x 3
• id Factor. Unique series identifier (4 total)
• date Date. Timestamp information. Weekly format.
• value Numeric. Value at the corresponding timestamp.
Details
This is a sample of 4 Weekly data sets from the M4 competition.
Source
• M4 Competition Website
Examples
m4_weekly
Description
The fourth M Competition. M4, started on 1 January 2018 and ended in 31 May 2018. The compe-
tition included 100,000 time series datasets. This dataset includes a sample of 4 yearly time series
from the competition.
Usage
m4_yearly
Format
A tibble: 9,743 x 3
• id Factor. Unique series identifier (4 total)
• date Date. Timestamp information. Yearly format.
• value Numeric. Value at the corresponding timestamp.
Details
This is a sample of 4 Yearly data sets from the M4 competition.
Source
• M4 Competition Website
34 mutate_by_time
Examples
m4_yearly
Description
mutate_by_time() is a time-based variant of the popular dplyr::mutate() function that uses
.date_var to specify a date or date-time column and .by to group the calculation by groups like
"5 seconds", "week", or "3 months".
Usage
mutate_by_time(
.data,
.date_var,
.by = "day",
...,
.type = c("floor", "ceiling", "round")
)
Arguments
.data A tbl object or data.frame
.date_var A column containing date or date-time values to summarize. If missing, attempts
to auto-detect date column.
.by A time unit to summarise by. Time units are collapsed using lubridate::floor_date()
or lubridate::ceiling_date().
The value can be:
• second
• minute
• hour
• day
• week
• month
• bimonth
• quarter
• season
• halfyear
• year
mutate_by_time 35
Value
A tibble or data.frame
See Also
Time-Based dplyr functions:
• summarise_by_time() - Easily summarise using a date column.
• mutate_by_time() - Simplifies applying mutations by time windows.
• pad_by_time() - Insert time series rows with regularly spaced timestamps
• filter_by_time() - Quickly filter using date ranges.
• filter_period() - Apply filtering expressions inside periods (windows)
• slice_period() - Apply slice inside periods (windows)
• condense_period() - Convert to a different periodicity
• between_time() - Range detection for date or date-time sequences.
• slidify() - Turn any function into a sliding (rolling) function
Examples
# Libraries
library(dplyr)
tidyr::pivot_longer(value:first_value_by_month) %>%
plot_time_series(date, value, name,
.facet_scale = "free", .facet_ncol = 2,
.smooth = FALSE, .interactive = FALSE)
Description
Normalization is commonly used to center and scale numeric features to prevent one from domi-
nating in algorithms that require data to be on the same scale.
Usage
normalize_vec(x, min = NULL, max = NULL, silent = FALSE)
Arguments
x A numeric vector.
min The population min value in the normalization process.
max The population max value in the normalization process.
silent Whether or not to report the automated min and max parameters as a message.
Details
Standardization vs Normalization
• Standardization refers to a transformation that reduces the range to mean 0, standard devia-
tion 1
• Normalization refers to a transformation that reduces the min-max range: (0, 1)
Value
A numeric vector with the transformation applied.
See Also
• Normalization/Standardization: standardize_vec(), normalize_vec()
• Box Cox Transformation: box_cox_vec()
• Lag Transformation: lag_vec()
• Differencing Transformation: diff_vec()
• Rolling Window Transformation: slidify_vec()
pad_by_time 37
Examples
library(dplyr)
m4_daily %>%
group_by(id) %>%
mutate(value_norm = normalize_vec(value))
Description
The easiest way to fill in missing timestamps or convert to a more granular period (e.g. quarter to
month). Wraps the padr::pad() function for padding tibbles.
Usage
pad_by_time(
.data,
.date_var,
.by = "auto",
.pad_value = NA,
.fill_na_direction = c("none", "down", "up", "downup", "updown"),
.start_date = NULL,
.end_date = NULL
)
38 pad_by_time
Arguments
.data A tibble with a time-based column.
.date_var A column containing date or date-time values to pad
.by Either "auto", a time-based frequency like "year", "month", "day", "hour", etc,
or a time expression like "5 min", or "7 days". See Details.
.pad_value Fills in padded values. Default is NA.
.fill_na_direction
Users can provide an NA fill strategy using tidyr::fill(). Possible values:
'none', 'down', 'up', 'downup', 'updown'. Default: 'none'
.start_date Specifies the start of the padded series. If NULL it will use the lowest value of
the input variable.
.end_date Specifies the end of the padded series. If NULL it will use the highest value of
the input variable.
Details
Padding Missing Observations
The most common use case for pad_by_time() is to add rows where timestamps are missing. This
could be from sales data that have missing values on weekends and holidays. Or it could be high
frequency data where observations are irregularly spaced and should be reset to a regular frequency.
Going from Low to High Frequency
The second use case is going from a low frequency (e.g. day) to high frequency (e.g. hour). This is
possible by supplying a higher frequency to pad_by_time().
Interval, .by
Padding can be applied in the following ways:
• .by = "auto" - pad_by_time() will detect the time-stamp frequency and apply padding.
• The eight intervals in are: year, quarter, month, week, day, hour, min, and sec.
• Intervals like 5 minutes, 6 hours, 10 days are possible.
Value
A tibble or data.frame with rows containing missing timestamps added.
References
• This function wraps the padr::pad() function developed by Edwin Thoen.
pad_by_time 39
See Also
Imputation:
Examples
library(dplyr)
# Detects missing quarter, and pads the missing regularly spaced quarter with NA
missing_data_tbl %>% pad_by_time(date, .by = "quarter")
Description
Significantly faster time series parsing than readr::parse_date, readr::parse_datetime, lubridate::as_date(),
and lubridate::as_datetime(). Uses anytime package, which relies on Boost.Date_Time C++
library for date/datetime parsing.
Usage
parse_date2(x, ..., silent = FALSE)
Arguments
x A character vector
... Additional parameters passed to anytime() and anydate()
silent If TRUE, warns the user of parsing failures.
tz Datetime only. A timezone (see OlsenNames()).
tz_shift Datetime only. If FALSE, forces the datetime into the time zone. If TRUE,
offsets the datetime from UTC to the new time zone.
plot_acf_diagnostics 41
Details
Parsing Formats
• Date Formats: Must follow a Year, Month, Day sequence. (e.g. parse_date2("2011 June")
is OK, parse_date2("June 2011") is NOT OK).
• Date Time Formats: Must follow a YMD HMS sequence.
Value
Returns a date or datatime vector from the transformation applied to character timestamp vector.
References
Examples
plot_acf_diagnostics Visualize the ACF, PACF, and CCFs for One or More Time Series
Description
Returns the ACF and PACF of a target and optionally CCF’s of one or more lagged predictors
in interactive plotly plots. Scales to multiple time series with group_by().
42 plot_acf_diagnostics
Usage
plot_acf_diagnostics(
.data,
.date_var,
.value,
.ccf_vars = NULL,
.lags = 1000,
.show_ccf_vars_only = FALSE,
.show_white_noise_bars = TRUE,
.facet_ncol = 1,
.facet_scales = "fixed",
.line_color = "#2c3e50",
.line_size = 0.5,
.line_alpha = 1,
.point_color = "#2c3e50",
.point_size = 1,
.point_alpha = 1,
.x_intercept = NULL,
.x_intercept_color = "#E31A1C",
.hline_color = "#2c3e50",
.white_noise_line_type = 2,
.white_noise_line_color = "#A6CEE3",
.title = "Lag Diagnostics",
.x_lab = "Lag",
.y_lab = "Correlation",
.interactive = TRUE,
.plotly_slider = FALSE
)
Arguments
.data A data frame or tibble with numeric features (values) in descending chronolog-
ical order
.date_var A column containing either date or date-time values
.value A numeric column with a value to have ACF and PACF calculations performed.
.ccf_vars Additional features to perform Lag Cross Correlations (CCFs) versus the .value.
Useful for evaluating external lagged regressors.
.lags A sequence of one or more lags to evaluate.
.show_ccf_vars_only
Hides the ACF and PACF plots so you can focus on only CCFs.
.show_white_noise_bars
Shows the white noise significance bounds.
.facet_ncol Facets: Number of facet columns. Has no effect if using grouped_df.
.facet_scales Facets: Options include "fixed", "free", "free_y", "free_x"
.line_color Line color. Use keyword: "scale_color" to change the color by the facet.
plot_acf_diagnostics 43
Details
Simplified ACF, PACF, & CCF
We are often interested in all 3 of these functions. Why not get all 3+ at once? Now you can.
Lag Specification
Lags (.lags) can either be specified as:
Unlike other plotting utilities, the .facet_vars arguments is NOT included. Use dplyr::group_by()
for processing multiple time series groups.
Calculating the White Noise Significance Bars
The formula for the significance bars is +2/sqrt(T) and -2/sqrt(T) where T is the length of the
time series. For a white noise time series, 95% of the data points should fall within this range.
Those that don’t may be significant autocorrelations.
Value
A static ggplot2 plot or an interactive plotly plot
See Also
• Visualizing ACF, PACF, & CCF: plot_acf_diagnostics()
• Visualizing Seasonality: plot_seasonal_diagnostics()
• Visualizing Time Series: plot_time_series()
Examples
library(dplyr)
library(ggplot2)
# Apply Transformations
# - Differencing transformation to identify ARIMA & SARIMA Orders
m4_hourly %>%
group_by(id) %>%
plot_acf_diagnostics(
date, value, # ACF & PACF
.lags = "7 days", # 7-Days of hourly lags
.interactive = FALSE
)
# Apply Transformations
# - Differencing transformation to identify ARIMA & SARIMA Orders
m4_hourly %>%
group_by(id) %>%
plot_acf_diagnostics(
date,
diff_vec(value, lag = 1), # Difference the value column
.lags = 0:(24*7), # 7-Days of hourly lags
.interactive = FALSE
) +
ggtitle("ACF Diagnostics", subtitle = "1st Difference")
# CCFs Too!
walmart_sales_weekly %>%
select(id, Date, Weekly_Sales, Temperature, Fuel_Price) %>%
group_by(id) %>%
plot_acf_diagnostics(
Date, Weekly_Sales, # ACF & PACF
plot_anomalies 45
Description
plot_anomalies() is an interactive and scalable function for visualizing anomalies in time series
data. Plots are available in interactive plotly (default) and static ggplot2 format.
plot_anomalies_decomp(): Takes in data from the anomalize() function, and returns a plot of
the anomaly decomposition. Useful for interpeting how the anomalize() function is determining
outliers from "remainder".
plot_anomalies_cleaned() helps users visualize the before/after of cleaning anomalies.
Usage
plot_anomalies(
.data,
.date_var,
.facet_vars = NULL,
.facet_ncol = 1,
.facet_nrow = 1,
.facet_scales = "free",
.facet_dir = "h",
.facet_collapse = FALSE,
.facet_collapse_sep = " ",
.facet_strip_remove = FALSE,
.line_color = "#2c3e50",
.line_size = 0.5,
.line_type = 1,
.line_alpha = 1,
.anom_color = "#e31a1c",
.anom_alpha = 1,
.anom_size = 1.5,
.ribbon_fill = "grey20",
.ribbon_alpha = 0.2,
.legend_show = TRUE,
.title = "Anomaly Plot",
.x_lab = "",
.y_lab = "",
.color_lab = "Anomaly",
.interactive = TRUE,
.trelliscope = FALSE,
46 plot_anomalies
.trelliscope_params = list()
)
plot_anomalies_decomp(
.data,
.date_var,
.facet_vars = NULL,
.facet_scales = "free",
.line_color = "#2c3e50",
.line_size = 0.5,
.line_type = 1,
.line_alpha = 1,
.title = "Anomaly Decomposition Plot",
.x_lab = "",
.y_lab = "",
.interactive = TRUE
)
plot_anomalies_cleaned(
.data,
.date_var,
.facet_vars = NULL,
.facet_ncol = 1,
.facet_nrow = 1,
.facet_scales = "free",
.facet_dir = "h",
.facet_collapse = FALSE,
.facet_collapse_sep = " ",
.facet_strip_remove = FALSE,
.line_color = "#2c3e50",
.line_size = 0.5,
.line_type = 1,
.line_alpha = 1,
.cleaned_line_color = "#e31a1c",
.cleaned_line_size = 0.5,
.cleaned_line_type = 1,
.cleaned_line_alpha = 1,
.legend_show = TRUE,
.title = "Anomalies Cleaned Plot",
.x_lab = "",
.y_lab = "",
.color_lab = "Legend",
.interactive = TRUE,
.trelliscope = FALSE,
.trelliscope_params = list()
)
plot_anomalies 47
Arguments
.data A tibble or data.frame that has been anomalized by anomalize()
.date_var A column containing either date or date-time values
.facet_vars One or more grouping columns that broken out into ggplot2 facets. These can
be selected using tidyselect() helpers (e.g contains()).
.facet_ncol Number of facet columns.
.facet_nrow Number of facet rows (only used for .trelliscope = TRUE)
.facet_scales Control facet x & y-axis ranges. Options include "fixed", "free", "free_y",
"free_x"
.facet_dir The direction of faceting ("h" for horizontal, "v" for vertical). Default is "h".
.facet_collapse
Multiple facets included on one facet strip instead of multiple facet strips.
.facet_collapse_sep
The separator used for collapsing facets.
.facet_strip_remove
Whether or not to remove the strip and text label for each facet.
.line_color Line color.
.line_size Line size.
.line_type Line type.
.line_alpha Line alpha (opacity). Range: (0, 1).
.anom_color Color for the anomaly dots
.anom_alpha Opacity for the anomaly dots. Range: (0, 1).
.anom_size Size for the anomaly dots
.ribbon_fill Fill color for the acceptable range
.ribbon_alpha Fill opacity for the acceptable range. Range: (0, 1).
.legend_show Toggles on/off the Legend
.title Plot title.
.x_lab Plot x-axis label
.y_lab Plot y-axis label
.color_lab Plot label for the color legend
.interactive If TRUE, returns a plotly interactive plot. If FALSE, returns a static ggplot2
plot.
.trelliscope Returns either a normal plot or a trelliscopejs plot (great for many time series)
Must have trelliscopejs installed.
.trelliscope_params
Pass parameters to the trelliscopejs::facet_trelliscope() function as a
list(). The only parameters that cannot be passed are:
• ncol: use .facet_ncol
• nrow: use .facet_nrow
• scales: use facet_scales
48 plot_anomaly_diagnostics
Value
A plotly or ggplot2 visualization
Examples
# Plot Anomalies
library(dplyr)
walmart_sales_weekly %>%
filter(id %in% c("1_1", "1_3")) %>%
group_by(id) %>%
anomalize(Date, Weekly_Sales) %>%
plot_anomalies(Date, .facet_ncol = 2, .ribbon_alpha = 0.25, .interactive = FALSE)
walmart_sales_weekly %>%
filter(id %in% c("1_1", "1_3")) %>%
group_by(id) %>%
anomalize(Date, Weekly_Sales, .message = FALSE) %>%
plot_anomalies_decomp(Date, .interactive = FALSE)
walmart_sales_weekly %>%
filter(id %in% c("1_1", "1_3")) %>%
group_by(id) %>%
anomalize(Date, Weekly_Sales, .message = FALSE) %>%
plot_anomalies_cleaned(Date, .facet_ncol = 2, .interactive = FALSE)
plot_anomaly_diagnostics
Visualize Anomalies for One or More Time Series
plot_anomaly_diagnostics 49
Description
An interactive and scalable function for visualizing anomalies in time series data. Plots are available
in interactive plotly (default) and static ggplot2 format.
Usage
plot_anomaly_diagnostics(
.data,
.date_var,
.value,
.facet_vars = NULL,
.frequency = "auto",
.trend = "auto",
.alpha = 0.05,
.max_anomalies = 0.2,
.message = TRUE,
.facet_ncol = 1,
.facet_nrow = 1,
.facet_scales = "free",
.facet_dir = "h",
.facet_collapse = FALSE,
.facet_collapse_sep = " ",
.facet_strip_remove = FALSE,
.line_color = "#2c3e50",
.line_size = 0.5,
.line_type = 1,
.line_alpha = 1,
.anom_color = "#e31a1c",
.anom_alpha = 1,
.anom_size = 1.5,
.ribbon_fill = "grey20",
.ribbon_alpha = 0.2,
.legend_show = TRUE,
.title = "Anomaly Diagnostics",
.x_lab = "",
.y_lab = "",
.color_lab = "Anomaly",
.interactive = TRUE,
.trelliscope = FALSE,
.trelliscope_params = list()
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
50 plot_anomaly_diagnostics
.facet_vars One or more grouping columns that broken out into ggplot2 facets. These can
be selected using tidyselect() helpers (e.g contains()).
.frequency Controls the seasonal adjustment (removal of seasonality). Input can be either
"auto", a time-based definition (e.g. "2 weeks"), or a numeric number of obser-
vations per frequency (e.g. 10). Refer to tk_get_frequency().
.trend Controls the trend component. For STL, trend controls the sensitivity of the
LOESS smoother, which is used to remove the remainder. Refer to tk_get_trend().
.alpha Controls the width of the "normal" range. Lower values are more conservative
while higher values are less prone to incorrectly classifying "normal" observa-
tions.
.max_anomalies The maximum percent of anomalies permitted to be identified.
.message A boolean. If TRUE, will output information related to automatic frequency and
trend selection (if applicable).
.facet_ncol Number of facet columns.
.facet_nrow Number of facet rows (only used for .trelliscope = TRUE)
.facet_scales Control facet x & y-axis ranges. Options include "fixed", "free", "free_y",
"free_x"
.facet_dir The direction of faceting ("h" for horizontal, "v" for vertical). Default is "h".
.facet_collapse
Multiple facets included on one facet strip instead of multiple facet strips.
.facet_collapse_sep
The separator used for collapsing facets.
.facet_strip_remove
Whether or not to remove the strip and text label for each facet.
.line_color Line color.
.line_size Line size.
.line_type Line type.
.line_alpha Line alpha (opacity). Range: (0, 1).
.anom_color Color for the anomaly dots
.anom_alpha Opacity for the anomaly dots. Range: (0, 1).
.anom_size Size for the anomaly dots
.ribbon_fill Fill color for the acceptable range
.ribbon_alpha Fill opacity for the acceptable range. Range: (0, 1).
.legend_show Toggles on/off the Legend
.title Plot title.
.x_lab Plot x-axis label
.y_lab Plot y-axis label
.color_lab Plot label for the color legend
.interactive If TRUE, returns a plotly interactive plot. If FALSE, returns a static ggplot2
plot.
plot_anomaly_diagnostics 51
.trelliscope Returns either a normal plot or a trelliscopejs plot (great for many time series)
Must have trelliscopejs installed.
.trelliscope_params
Pass parameters to the trelliscopejs::facet_trelliscope() function as a
list(). The only parameters that cannot be passed are:
• ncol: use .facet_ncol
• nrow: use .facet_nrow
• scales: use facet_scales
• as_plotly: use .interactive
Details
The plot_anomaly_diagnostics() is a visualization wrapper for tk_anomaly_diagnostics()
group-wise anomaly detection, implements a 2-step process to detect outliers in time series.
Step 1: Detrend & Remove Seasonality using STL Decomposition
The decomposition separates the "season" and "trend" components from the "observed" values leav-
ing the "remainder" for anomaly detection.
The user can control two parameters: frequency and trend.
1. .frequency: Adjusts the "season" component that is removed from the "observed" values.
2. .trend: Adjusts the trend window (t.window parameter from stats::stl() that is used.
The user may supply both .frequency and .trend as time-based durations (e.g. "6 weeks") or
numeric values (e.g. 180) or "auto", which predetermines the frequency and/or trend based on the
scale of the time series using the tk_time_scale_template().
Step 2: Anomaly Detection
Once "trend" and "season" (seasonality) is removed, anomaly detection is performed on the "re-
mainder". Anomalies are identified, and boundaries (recomposed_l1 and recomposed_l2) are de-
termined.
The Anomaly Detection Method uses an inner quartile range (IQR) of +/-25 the median.
IQR Adjustment, alpha parameter
With the default alpha = 0.05, the limits are established by expanding the 25/75 baseline by an
IQR Factor of 3 (3X). The IQR Factor = 0.15 / alpha (hence 3X with alpha = 0.05):
• To increase the IQR Factor controlling the limits, decrease the alpha, which makes it more
difficult to be an outlier.
• Increase alpha to make it easier to be an outlier.
• The IQR outlier detection method is used in forecast::tsoutliers().
• A similar outlier detection method is used by Twitter’s AnomalyDetection package.
• Both Twitter and Forecast tsoutliers methods have been implemented in Business Science’s
anomalize package.
Value
A plotly or ggplot2 visualization
52 plot_seasonal_diagnostics
References
1. CLEVELAND, R. B., CLEVELAND, W. S., MCRAE, J. E., AND TERPENNING, I. STL: A
Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, Vol.
6, No. 1 (1990), pp. 3-73.
2. Owen S. Vallis, Jordan Hochenbaum and Arun Kejariwal (2014). A Novel Technique for
Long-Term Anomaly Detection in the Cloud. Twitter Inc.
See Also
• tk_anomaly_diagnostics(): Group-wise anomaly detection
Examples
library(dplyr)
walmart_sales_weekly %>%
group_by(id) %>%
plot_anomaly_diagnostics(Date, Weekly_Sales,
.message = FALSE,
.facet_ncol = 3,
.ribbon_alpha = 0.25,
.interactive = FALSE)
plot_seasonal_diagnostics
Visualize Multiple Seasonality Features for One or More Time Series
Description
An interactive and scalable function for visualizing time series seasonality. Plots are available in
interactive plotly (default) and static ggplot2 format.
Usage
plot_seasonal_diagnostics(
.data,
.date_var,
.value,
.facet_vars = NULL,
.feature_set = "auto",
.geom = c("boxplot", "violin"),
.geom_color = "#2c3e50",
.geom_outlier_color = "#2c3e50",
.title = "Seasonal Diagnostics",
plot_seasonal_diagnostics 53
.x_lab = "",
.y_lab = "",
.interactive = TRUE
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
.facet_vars One or more grouping columns that broken out into ggplot2 facets. These can
be selected using tidyselect() helpers (e.g contains()).
.feature_set One or multiple selections to analyze for seasonality. Choices include:
• "auto" - Automatically selects features based on the time stamps and length
of the series.
• "second" - Good for analyzing seasonality by second of each minute.
• "minute" - Good for analyzing seasonality by minute of the hour
• "hour" - Good for analyzing seasonality by hour of the day
• "wday.lbl" - Labeled weekdays. Good for analyzing seasonality by day of
the week.
• "week" - Good for analyzing seasonality by week of the year.
• "month.lbl" - Labeled months. Good for analyzing seasonality by month of
the year.
• "quarter" - Good for analyzing seasonality by quarter of the year
• "year" - Good for analyzing seasonality over multiple years.
.geom Either "boxplot" or "violin"
.geom_color Geometry color. Line color. Use keyword: "scale_color" to change the color by
the facet.
.geom_outlier_color
Color used to highlight outliers.
.title Plot title.
.x_lab Plot x-axis label
.y_lab Plot y-axis label
.interactive If TRUE, returns a plotly interactive plot. If FALSE, returns a static ggplot2
plot.
Details
Automatic Feature Selection
Internal calculations are performed to detect a sub-range of features to include useing the following
logic:
• The minimum feature is selected based on the median difference between consecutive times-
tamps
54 plot_seasonal_diagnostics
Example: Hourly timestamp data that lasts more than 2 weeks will have the following features:
"hour", "wday.lbl", and "week".
Scalable with Grouped Data Frames
This function respects grouped data.frame and tibbles that were made with dplyr::group_by().
For grouped data, the automatic feature selection returned is a collection of all features within the
sub-groups. This means extra features are returned even though they may be meaningless for some
of the groups.
Transformations
The .value parameter respects transformations (e.g. .value = log(sales)).
Value
A plotly or ggplot2 visualization
Examples
library(dplyr)
# Visualize series
taylor_30_min %>%
plot_time_series(date, value, .interactive = FALSE)
# Visualize seasonality
taylor_30_min %>%
plot_seasonal_diagnostics(date, value, .interactive = FALSE)
# Visualize series
m4_hourly %>%
group_by(id) %>%
plot_time_series(date, value, .facet_scales = "free", .interactive = FALSE)
# Visualize seasonality
m4_hourly %>%
group_by(id) %>%
plot_seasonal_diagnostics(date, value, .interactive = FALSE)
plot_stl_diagnostics 55
plot_stl_diagnostics Visualize STL Decomposition Features for One or More Time Series
Description
An interactive and scalable function for visualizing time series STL Decomposition. Plots are
available in interactive plotly (default) and static ggplot2 format.
Usage
plot_stl_diagnostics(
.data,
.date_var,
.value,
.facet_vars = NULL,
.feature_set = c("observed", "season", "trend", "remainder", "seasadj"),
.frequency = "auto",
.trend = "auto",
.message = TRUE,
.facet_scales = "free",
.line_color = "#2c3e50",
.line_size = 0.5,
.line_type = 1,
.line_alpha = 1,
.title = "STL Diagnostics",
.x_lab = "",
.y_lab = "",
.interactive = TRUE
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
.facet_vars One or more grouping columns that broken out into ggplot2 facets. These can
be selected using tidyselect() helpers (e.g contains()).
.feature_set The STL decompositions to visualize. Select one or more of "observed", "sea-
son", "trend", "remainder", "seasadj".
.frequency Controls the seasonal adjustment (removal of seasonality). Input can be either
"auto", a time-based definition (e.g. "2 weeks"), or a numeric number of obser-
vations per frequency (e.g. 10). Refer to tk_get_frequency().
.trend Controls the trend component. For STL, trend controls the sensitivity of the
lowess smoother, which is used to remove the remainder.
56 plot_stl_diagnostics
.message A boolean. If TRUE, will output information related to automatic frequency and
trend selection (if applicable).
.facet_scales Control facet x & y-axis ranges. Options include "fixed", "free", "free_y",
"free_x"
.line_color Line color.
.line_size Line size.
.line_type Line type.
.line_alpha Line alpha (opacity). Range: (0, 1).
.title Plot title.
.x_lab Plot x-axis label
.y_lab Plot y-axis label
.interactive If TRUE, returns a plotly interactive plot. If FALSE, returns a static ggplot2
plot.
Details
The plot_stl_diagnostics() function generates a Seasonal-Trend-Loess decomposition. The
function is "tidy" in the sense that it works on data frames and is designed to work with dplyr
groups.
STL method:
The STL method implements time series decomposition using the underlying stats::stl(). The
decomposition separates the "season" and "trend" components from the "observed" values leaving
the "remainder".
Frequency & Trend Selection
The user can control two parameters: .frequency and .trend.
1. The .frequency parameter adjusts the "season" component that is removed from the "ob-
served" values.
2. The .trend parameter adjusts the trend window (t.window parameter from stl()) that is
used.
The user may supply both .frequency and .trend as time-based durations (e.g. "6 weeks") or
numeric values (e.g. 180) or "auto", which automatically selects the frequency and/or trend based
on the scale of the time series.
Value
A plotly or ggplot2 visualization
Examples
library(dplyr)
plot_stl_diagnostics(
date, value,
# Set features to return, desired frequency and trend
.feature_set = c("observed", "season", "trend", "remainder"),
.frequency = "24 hours",
.trend = "1 week",
.interactive = FALSE)
Description
A workhorse time-series plotting function that generates interactive plotly plots, consolidates 20+
lines of ggplot2 code, and scales well to many time series.
Usage
plot_time_series(
.data,
.date_var,
.value,
.color_var = NULL,
.facet_vars = NULL,
.facet_ncol = 1,
.facet_nrow = 1,
.facet_scales = "free_y",
.facet_dir = "h",
.facet_collapse = FALSE,
.facet_collapse_sep = " ",
.facet_strip_remove = FALSE,
.line_color = "#2c3e50",
.line_size = 0.5,
.line_type = 1,
.line_alpha = 1,
.y_intercept = NULL,
58 plot_time_series
.y_intercept_color = "#2c3e50",
.x_intercept = NULL,
.x_intercept_color = "#2c3e50",
.smooth = TRUE,
.smooth_period = "auto",
.smooth_message = FALSE,
.smooth_span = NULL,
.smooth_degree = 2,
.smooth_color = "#3366FF",
.smooth_size = 1,
.smooth_alpha = 1,
.legend_show = TRUE,
.title = "Time Series Plot",
.x_lab = "",
.y_lab = "",
.color_lab = "Legend",
.interactive = TRUE,
.plotly_slider = FALSE,
.trelliscope = FALSE,
.trelliscope_params = list()
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
.color_var A categorical column that can be used to change the line color
.facet_vars One or more grouping columns that broken out into ggplot2 facets. These can
be selected using tidyselect() helpers (e.g contains()).
.facet_ncol Number of facet columns.
.facet_nrow Number of facet rows (only used for .trelliscope = TRUE)
.facet_scales Control facet x & y-axis ranges. Options include "fixed", "free", "free_y",
"free_x"
.facet_dir The direction of faceting ("h" for horizontal, "v" for vertical). Default is "h".
.facet_collapse
Multiple facets included on one facet strip instead of multiple facet strips.
.facet_collapse_sep
The separator used for collapsing facets.
.facet_strip_remove
Whether or not to remove the strip and text label for each facet.
.line_color Line color. Overrided if .color_var is specified.
.line_size Line size.
.line_type Line type.
plot_time_series 59
Details
plot_time_series() is a scalable function that works with both ungrouped and grouped data.frame
objects (and tibbles!).
Interactive by Default
plot_time_series() is built for exploration using:
By default, the .smooth_period is automatically calculated using 75% of the observertions. This
is the same as geom_smooth(method = "loess", span = 0.75).
A user can specify a time-based window (e.g. .smooth_period = "1 year") or a numeric value
(e.g. smooth_period = 365).
Time-based windows return the median number of observations in a window using tk_get_trend().
Value
A static ggplot2 plot or an interactive plotly plot
Examples
library(dplyr)
library(lubridate)
plot_time_series_boxplot
Interactive Time Series Box Plots
Description
A boxplot function that generates interactive plotly plots for time series.
Usage
plot_time_series_boxplot(
.data,
.date_var,
.value,
.period,
.color_var = NULL,
.facet_vars = NULL,
62 plot_time_series_boxplot
.facet_ncol = 1,
.facet_nrow = 1,
.facet_scales = "free_y",
.facet_dir = "h",
.facet_collapse = FALSE,
.facet_collapse_sep = " ",
.facet_strip_remove = FALSE,
.line_color = "#2c3e50",
.line_size = 0.5,
.line_type = 1,
.line_alpha = 1,
.y_intercept = NULL,
.y_intercept_color = "#2c3e50",
.smooth = TRUE,
.smooth_func = ~mean(.x, na.rm = TRUE),
.smooth_period = "auto",
.smooth_message = FALSE,
.smooth_span = NULL,
.smooth_degree = 2,
.smooth_color = "#3366FF",
.smooth_size = 1,
.smooth_alpha = 1,
.legend_show = TRUE,
.title = "Time Series Plot",
.x_lab = "",
.y_lab = "",
.color_lab = "Legend",
.interactive = TRUE,
.plotly_slider = FALSE,
.trelliscope = FALSE,
.trelliscope_params = list()
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
.period A time series unit of aggregation for the boxplot. Examples include:
• "1 week"
• "3 years"
• "30 minutes"
.color_var A categorical column that can be used to change the line color
.facet_vars One or more grouping columns that broken out into ggplot2 facets. These can
be selected using tidyselect() helpers (e.g contains()).
.facet_ncol Number of facet columns.
plot_time_series_boxplot 63
Details
plot_time_series_boxplot() is a scalable function that works with both ungrouped and grouped
data.frame objects (and tibbles!).
Interactive by Default
plot_time_series_boxplot() is built for exploration using:
• Interactive Plots: plotly (default) - Great for exploring!
• Static Plots: ggplot2 (set .interactive = FALSE) - Great for PDF Reports
By default, an interactive plotly visualization is returned.
Scalable with Facets & Dplyr Groups
plot_time_series_boxplot() returns multiple time series plots using ggplot2 facets:
• group_by() - If groups are detected, multiple facets are returned
• plot_time_series_boxplot(.facet_vars) - You can manually supply facets as well.
Can Transform Values just like ggplot
The .values argument accepts transformations just like ggplot2. For example, if you want to take
the log of sales you can use a call like plot_time_series_boxplot(date, log(sales)) and the
log transformation will be applied.
Smoother Period / Span Calculation
The .smooth = TRUE option returns a smoother that is calculated based on either:
1. A .smooth_func: The method of aggregation. Usually an aggregation like mean is used. The
purrr-style function syntax can be used to apply complex functions.
2. A .smooth_period: Number of observations
3. A .smooth_span: A percentage of observations
By default, the .smooth_period is automatically calculated using 75% of the observertions. This
is the same as geom_smooth(method = "loess", span = 0.75).
A user can specify a time-based window (e.g. .smooth_period = "1 year") or a numeric value
(e.g. smooth_period = 365).
Time-based windows return the median number of observations in a window using tk_get_trend().
plot_time_series_boxplot 65
Value
A static ggplot2 plot or an interactive plotly plot
Examples
library(dplyr)
library(lubridate)
.interactive = FALSE)
plot_time_series_cv_plan
Visualize a Time Series Resample Plan
Description
The plot_time_series_cv_plan() function provides a visualization for a time series resample
specification (rset) of either rolling_origin or time_series_cv class.
Usage
plot_time_series_cv_plan(
.data,
.date_var,
.value,
...,
.smooth = FALSE,
.title = "Time Series Cross Validation Plan"
)
Arguments
.data A time series resample specification of of either rolling_origin or time_series_cv
class or a data frame (tibble) that has been prepared using tk_time_series_cv_plan().
.date_var A column containing either date or date-time values
.value A column containing numeric values
... Additional parameters passed to plot_time_series()
.smooth Logical - Whether or not to include a trendline smoother. Uses See smooth_vec()
to apply a LOESS smoother.
.title Title for the plot
plot_time_series_cv_plan 67
Details
Resample Set
A resample set is an output of the timetk::time_series_cv() function or the rsample::rolling_origin()
function.
Value
Returns a static ggplot or interactive plotly object depending on whether or not .interactive is
FALSE or TRUE, respectively.
See Also
• time_series_cv() and rsample::rolling_origin() - Functions used to create time series
resample specfications.
• plot_time_series_cv_plan() - The plotting function used for visualizing the time series
resample plan.
Examples
library(dplyr)
library(rsample)
resample_spec %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(
date, adjusted, # date variable and value variable
# Additional arguments passed to plot_time_series(),
.facet_ncol = 2,
.line_alpha = 0.5,
.interactive = FALSE
)
68 plot_time_series_regression
plot_time_series_regression
Visualize a Time Series Linear Regression Formula
Description
A wrapper for stats::lm() that overlays a linear regression fitted model over a time series, which
can help show the effect of feature engineering
Usage
plot_time_series_regression(
.data,
.date_var,
.formula,
.show_summary = FALSE,
...
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.formula A linear regression formula. The left-hand side of the formula is used as the
y-axis value. The right-hand side of the formula is used to develop the linear
regression model. See stats::lm() for details.
.show_summary If TRUE, prints the summary.lm().
... Additional arguments passed to plot_time_series()
Details
plot_time_series_regression() is a scalable function that works with both ungrouped and
grouped data.frame objects (and tibbles!).
Time Series Formula
The .formula uses stats::lm() to apply a linear regression, which is used to visualize the effect
of feature engineering on a time series.
Interactive by Default
plot_time_series_regression() is built for exploration using:
Value
A static ggplot2 plot or an interactive plotly plot
Examples
library(dplyr)
library(lubridate)
set_tk_time_scale_template
Get and modify the Time Scale Template
Description
Get and modify the Time Scale Template
70 set_tk_time_scale_template
Usage
set_tk_time_scale_template(.data)
get_tk_time_scale_template()
tk_time_scale_template()
Arguments
Details
Used to get and set the time scale template, which is used by tk_get_frequency() and tk_get_trend()
when period = "auto".
The predefined template is stored in a function tk_time_scale_template(). This is the default
used by timetk.
Changing the Default Template
Value
See Also
Examples
get_tk_time_scale_template()
set_tk_time_scale_template(tk_time_scale_template())
slice_period 71
Description
Applies a dplyr slice inside a time-based period (window).
Usage
slice_period(.data, ..., .date_var, .period = "1 day")
Arguments
.data A tbl object or data.frame
... For slice(): <data-masking> Integer row values.
Provide either positive values to keep, or negative values to drop. The values
provided must be either all positive or all negative. Indices beyond the number
of rows in the input are silently ignored.
For slice_*(), these arguments are passed on to methods.
.date_var A column containing date or date-time values. If missing, attempts to auto-
detect date column.
.period A period to slice within. Time units are grouped using lubridate::floor_date()
or lubridate::ceiling_date().
The value can be:
• second
• minute
• hour
• day
• week
• month
• bimonth
• quarter
• season
• halfyear
• year
Arbitrary unique English abbreviations as in the lubridate::period() con-
structor are allowed:
• "1 year"
• "2 months"
• "30 seconds"
Value
A tibble or data.frame
72 slidify
See Also
Time-Based dplyr functions:
• summarise_by_time() - Easily summarise using a date column.
• mutate_by_time() - Simplifies applying mutations by time windows.
• pad_by_time() - Insert time series rows with regularly spaced timestamps
• filter_by_time() - Quickly filter using date ranges.
• filter_period() - Apply filtering expressions inside periods (windows)
• slice_period() - Apply slice inside periods (windows)
• condense_period() - Convert to a different periodicity
• between_time() - Range detection for date or date-time sequences.
• slidify() - Turn any function into a sliding (rolling) function
Examples
# Libraries
library(dplyr)
Description
slidify returns a rolling (sliding) version of the input function, with a rolling (sliding) .period
specified by the user.
Usage
slidify(
.f,
.period = 1,
.align = c("center", "left", "right"),
.partial = FALSE,
.unlist = TRUE
)
slidify 73
Arguments
.f A function, formula, or vector (not necessarily atomic).
If a function, it is used as is.
If a formula, e.g. ~ .x + 2, it is converted to a function. There are three ways to
refer to the arguments:
• For a single argument function, use .
• For a two argument function, use .x and .y
• For more arguments, use ..1, ..2, ..3 etc
This syntax allows you to create very compact anonymous functions. Note that
formula functions conceptually take dots (that’s why you can use ..1 etc). They
silently ignore additional arguments that are not used in the formula expression.
If character vector, numeric vector, or list, it is converted to an extractor func-
tion. Character vectors index by name and numeric vectors index by position;
use a list to index by position and name at different levels. If a component is not
present, the value of .default will be returned.
.period The period size to roll over
.align One of "center", "left" or "right".
.partial Should the moving window be allowed to return partial (incomplete) windows
instead of NA values. Set to FALSE by default, but can be switched to TRUE to
remove NA’s.
.unlist If the function returns a single value each time it is called, use .unlist = TRUE.
If the function returns more than one value, or a more complicated object (like a
linear model), use .unlist = FALSE to create a list-column of the rolling results.
Details
The slidify() function is almost identical to tibbletime::rollify() with 3 improvements:
1. Alignment ("center", "left", "right")
2. Partial windows are allowed
3. Uses slider under the hood, which improves speed and reliability by implementing code at
C++ level
Make any function a Sliding (Rolling) Function
slidify() turns a function into a sliding version of itself for use inside of a call to dplyr::mutate(),
however it works equally as well when called from purrr::map().
Because of it’s intended use with dplyr::mutate(), slidify creates a function that always returns
output with the same length of the input
Alignment
Rolling / Sliding functions generate .period - 1 fewer values than the incoming vector. Thus, the
vector needs to be aligned. Alignment of the vector follows 3 types:
• center (default): NA or .partial values are divided and added to the beginning and end of
the series to "Center" the moving average. This is common in Time Series applications (e.g.
denoising).
74 slidify
• left: NA or .partial values are added to the end to shift the series to the Left.
• right: NA or .partial values are added to the beginning to shift the series to the Right. This
is common in Financial Applications (e.g moving average cross-overs).
Value
A function with the rolling/sliding conversion applied.
References
• The Tibbletime R Package by Davis Vaughan, which includes the original rollify() Func-
tion
See Also
Transformation Functions:
Slider R Package:
Examples
library(dplyr)
# Turn the normal mean function into a rolling mean with a 5 row .period
mean_roll_5 <- slidify(mean, .period = 5, .align = "right")
FB %>%
mutate(rolling_mean_5 = mean_roll_5(adjusted))
# Use `partial = TRUE` to allow partial windows (those with less than the full .period)
mean_roll_5_partial <- slidify(mean, .period = 5, .align = "right", .partial = TRUE)
FB %>%
mutate(rolling_mean_5 = mean_roll_5_partial(adjusted))
slidify 75
# There's nothing stopping you from combining multiple rolling functions with
# different .period sizes in the same mutate call
FB %>%
select(symbol, date, adjusted) %>%
mutate(
rolling_mean_5 = mean_roll_5(adjusted),
rolling_mean_10 = mean_roll_10(adjusted)
)
FB %>%
select(symbol, date, adjusted) %>%
tk_augment_slidify(
adjusted, .period = 5:10, .f = mean, .align = "right",
.names = stringr::str_c("MA_", 5:10)
)
# One of the most powerful things about this is that it works with
# groups since `mutate` is being used
FANG %>%
group_by(symbol) %>%
mutate(mean_roll = mean_roll_3(adjusted)) %>%
slice(1:5)
FB %>%
mutate(running_cor = cor_roll(adjusted, open))
# With >2 args, create an anonymous function with >2 args or use
# the purrr convention of ..1, ..2, ..3 to refer to the arguments
avg_of_avgs <- slidify(
function(x, y, z) (mean(x) + mean(y) + mean(z)) / 3,
.period = 10,
.align = "right"
)
76 slidify_vec
# Or
avg_of_avgs <- slidify(
~(mean(..1) + mean(..2) + mean(..3)) / 3,
.period = 10,
.align = "right"
)
FB %>%
mutate(avg_of_avgs = avg_of_avgs(open, high, low))
FB %>%
mutate(roll_mean = roll_mean_na_rm(adjusted))
FB %>%
tidyr::drop_na() %>%
mutate(numeric_date = as.numeric(date)) %>%
mutate(rolling_lm = lm_roll(adjusted, numeric_date)) %>%
filter(!is.na(rolling_lm))
Description
slidify_vec() applies a summary function to a rolling sequence of windows.
Usage
slidify_vec(
.x,
.f,
...,
.period = 1,
slidify_vec 77
Arguments
.x A vector to have a rolling window transformation applied.
.f A summary [function / formula]
• If a function, e.g. mean, the function is used with any additional arguments,
....
• If a formula, e.g. ~ mean(., na.rm = TRUE), it is converted to a function.
This syntax allows you to create very compact anonymous functions.
... Additional arguments passed on to the .f function.
.period The number of periods to include in the local rolling window. This is effectively
the "window size".
.align One of "center", "left" or "right".
.partial Should the moving window be allowed to return partial (incomplete) windows
instead of NA values. Set to FALSE by default, but can be switched to TRUE to
remove NA’s.
Details
The slidify_vec() function is a wrapper for slider::slide_vec() with parameters simplified
"center", "left", "right" alignment.
Vector Length In == Vector Length Out
NA values or .partial values are always returned to ensure the length of the return vector is the
same length of the incoming vector. This ensures easier use with dplyr::mutate().
Alignment
Rolling functions generate .period - 1 fewer values than the incoming vector. Thus, the vector
needs to be aligned. Alignment of the vector follows 3 types:
• Center: NA or .partial values are divided and added to the beginning and end of the se-
ries to "Center" the moving average. This is common for de-noising operations. See also
[smooth_vec()] for LOESS without NA values.
• Left: NA or .partial values are added to the end to shift the series to the Left.
• Right: NA or .partial values are added to the beginning to shif the series to the Right. This
is common in Financial Applications such as moving average cross-overs.
Partial Values
• The advantage to using .partial values vs NA padding is that the series can be filled (good
for time-series de-noising operations).
• The downside to partial values is that the partials can become less stable at the regions where
incomplete windows are used.
If instability is not desirable for de-noising operations, a suitable alternative is smooth_vec(),
which implements local polynomial regression.
78 slidify_vec
Value
A numeric vector
References
• Slider R Package by Davis Vaughan
See Also
Modeling and More Complex Rolling Operations:
Examples
library(dplyr)
library(ggplot2)
# Training Data
FB_tbl <- FANG %>%
filter(symbol == "FB") %>%
select(symbol, date, adjusted)
FB_tbl %>%
mutate(
adjusted_loess_30 = smooth_vec(adjusted, period = 30, degree = 0),
adjusted_ma_30 = slidify_vec(adjusted, .f = mean,
.period = 30, .partial = TRUE)
) %>%
ggplot(aes(date, adjusted)) +
geom_line() +
geom_line(aes(y = adjusted_loess_30), color = "red") +
geom_line(aes(y = adjusted_ma_30), color = "blue") +
labs(title = "Loess vs Moving Average")
Description
smooth_vec() applies a LOESS transformation to a numeric vector.
80 smooth_vec
Usage
smooth_vec(x, period = 30, span = NULL, degree = 2)
Arguments
x A numeric vector to have a smoothing transformation applied.
period The number of periods to include in the local smoothing. Similar to window
size for a moving average. See details for an explanation period vs span spec-
ification.
span The span is a percentage of data to be included in the smoothing window. Pe-
riod is preferred for shorter windows to fix the window size. See details for an
explanation period vs span specification.
degree The degree of the polynomials to be used. Accetable values (least to most flexi-
ble): 0, 1, 2. Set to 2 by default for 2nd order polynomial (most flexible).
Details
Benefits:
• When using period, the effect is similar to a moving average without creating missing
values.
• When using span, the effect is to detect the trend in a series using a percentage of the total
number of observations.
Loess Smoother Algorithm This function is a simplified wrapper for the stats::loess() with a
modification to set a fixed period rather than a percentage of data points via a span.
Why Period vs Span? The period is fixed whereas the span changes as the number of observations
change.
When to use Period? The effect of using a period is similar to a Moving Average where the
Window Size is the Fixed Period. This helps when you are trying to smooth local trends. If you
want a 30-day moving average, specify period = 30.
When to use Span? Span is easier to specify when you want a Long-Term Trendline where the
window size is unknown. You can specify span = 0.75 to locally regress using a window of 75%
of the data.
Value
A numeric vector
See Also
Loess Modeling Functions:
Examples
library(dplyr)
library(ggplot2)
# Training Data
FB_tbl <- FANG %>%
filter(symbol == "FB") %>%
select(symbol, date, adjusted)
FB_tbl %>%
mutate(adjusted_30 = smooth_vec(adjusted, period = 30, degree = 2)) %>%
ggplot(aes(date, adjusted)) +
geom_line() +
geom_line(aes(y = adjusted_30), color = "red")
FB_tbl %>%
mutate(adjusted_30 = smooth_vec(adjusted, span = 0.75, degree = 2)) %>%
ggplot(aes(date, adjusted)) +
geom_line() +
geom_line(aes(y = adjusted_30), color = "red")
FB_tbl %>%
mutate(
adjusted_loess_30 = smooth_vec(adjusted, period = 30, degree = 0),
adjusted_ma_30 = slidify_vec(adjusted, .period = 30,
.f = mean, .partial = TRUE)
) %>%
ggplot(aes(date, adjusted)) +
geom_line() +
geom_line(aes(y = adjusted_loess_30), color = "red") +
geom_line(aes(y = adjusted_ma_30), color = "blue") +
labs(title = "Loess vs Moving Average")
82 standardize_vec
Description
Standardization is commonly used to center and scale numeric features to prevent one from domi-
nating in algorithms that require data to be on the same scale.
Usage
standardize_vec(x, mean = NULL, sd = NULL, silent = FALSE)
Arguments
x A numeric vector.
mean The mean used to invert the standardization
sd The standard deviation used to invert the standardization process.
silent Whether or not to report the automated mean and sd parameters as a message.
Details
Standardization vs Normalization
• Standardization refers to a transformation that reduces the range to mean 0, standard devia-
tion 1
• Normalization refers to a transformation that reduces the min-max range: (0, 1)
Value
Returns a numeric vector with the standardization transformation applied.
See Also
• Normalization/Standardization: standardize_vec(), normalize_vec()
• Box Cox Transformation: box_cox_vec()
• Lag Transformation: lag_vec()
• Differencing Transformation: diff_vec()
• Rolling Window Transformation: slidify_vec()
• Loess Smoothing Transformation: smooth_vec()
• Fourier Series: fourier_vec()
• Missing Value Imputation for Time Series: ts_impute_vec(), ts_clean_vec()
step_box_cox 83
Examples
library(dplyr)
m4_daily %>%
group_by(id) %>%
mutate(value_std = standardize_vec(value))
Description
step_box_cox creates a specification of a recipe step that will transform data using a Box-Cox
transformation. This function differs from recipes::step_BoxCox by adding multiple methods
including Guerrero lambda optimization and handling for negative data used in the Forecast R
Package.
Usage
step_box_cox(
recipe,
...,
method = c("guerrero", "loglik"),
limits = c(-1, 2),
role = NA,
trained = FALSE,
lambdas_trained = NULL,
skip = FALSE,
id = rand_id("box_cox")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more selector functions to choose which variables are affected by the
step. See selections() for more details. For the tidy method, these are not
currently used.
method One of "guerrero" or "loglik"
limits A length 2 numeric vector defining the range to compute the transformation
parameter lambda.
role Not used by this step since no new variables are created.
trained A logical to indicate if the quantities for preprocessing have been estimated.
lambdas_trained
A numeric vector of transformation values. This is NULL until computed by
prep().
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations
may not be able to be conducted on new data (e.g. processing the outcome
variable(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
id A character string that is unique to this step to identify it.
x A step_box_cox object.
Details
The step_box_cox() function is designed specifically to handle time series using methods imple-
mented in the Forecast R Package.
Negative Data
This function can be applied to Negative Data.
Lambda Optimization Methods
This function uses 2 methods for optimizing the lambda selection from the Forecast R Package:
1. method = "guerrero": Guerrero’s (1993) method is used, where lambda minimizes the coef-
ficient of variation for subseries of x.
2. method = loglik: the value of lambda is chosen to maximize the profile log likelihood of a
linear model fitted to x. For non-seasonal data, a linear time trend is fitted while for seasonal
data, a linear time trend with seasonal dummy variables is used.
Value
An updated version of recipe with the new step added to the sequence of existing steps (if any).
For the tidy method, a tibble with columns terms (the selectors or variables selected) and value
(the lambda estimate).
step_box_cox 85
References
1. Guerrero, V.M. (1993) Time-series analysis supported by power transformations. Journal of
Forecasting, 12, 37–48.
2. Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations. JRSS B 26 211–246.
See Also
Time Series Analysis:
• recipes::recipe()
• recipes::prep()
• recipes::bake()
Examples
library(dplyr)
library(recipes)
Description
step_diff creates a specification of a recipe step that will add new columns of differenced data.
Differenced data will include NA values where a difference was induced. These can be removed
with step_naomit().
Usage
step_diff(
recipe,
...,
role = "predictor",
trained = FALSE,
lag = 1,
difference = 1,
log = FALSE,
prefix = "diff_",
columns = NULL,
skip = FALSE,
id = rand_id("diff")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more selector functions to choose which variables are affected by the
step. See selections() for more details.
role Defaults to "predictor"
trained A logical to indicate if the quantities for preprocessing have been estimated.
lag A vector of positive integers identifying which lags (how far back) to be included
in the differencing calculation.
difference The number of differences to perform.
log Calculates log differences instead of differences.
prefix A prefix for generated column names, default to "diff_".
columns A character string of variable names that will be populated (eventually) by the
terms argument.
step_diff 87
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations
may not be able to be conducted on new data (e.g. processing the outcome
variable(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations
id A character string that is unique to this step to identify it.
x A step_diff object.
Details
The step assumes that the data are already in the proper sequential order for lagging.
Value
An updated version of recipe with the new step added to the sequence of existing steps (if any).
See Also
Time Series Analysis:
Remove NA Values:
• recipes::step_naomit()
• recipes::recipe()
• recipes::prep()
• recipes::bake()
Examples
library(recipes)
Description
step_fourier creates a a specification of a recipe step that will convert a Date or Date-time column
into a Fourier series
Usage
step_fourier(
recipe,
...,
period,
K,
role = "predictor",
trained = FALSE,
columns = NULL,
scale_factor = NULL,
skip = FALSE,
id = rand_id("fourier")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... A single column with class Date or POSIXct. See recipes::selections() for
more details. For the tidy method, these are not currently used.
period The numeric period for the oscillation frequency. See details for examples of
period specification.
step_fourier 89
K The number of orders to include for each sine/cosine fourier series. More orders
increase the number of fourier terms and therefore the variance of the fitted
model at the expense of bias. See details for examples of K specification.
role For model terms created by this step, what analysis role should they be as-
signed?. By default, the function assumes that the new variable columns created
by the original variables will be used as predictors in a model.
trained A logical to indicate if the quantities for preprocessing have been estimated.
columns A character string of variables that will be used as inputs. This field is a place-
holder and will be populated once recipes::prep() is used.
scale_factor A factor for scaling the numeric index extracted from the date or date-time fea-
ture. This is a placeholder and will be populated once recipes::prep() is
used.
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations may
not be able to be conducted on new data (e.g. processing the outcome vari-
able(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
id A character string that is unique to this step to identify it.
x A step_fourier object.
Details
Date Variable
Unlike other steps, step_fourier does not remove the original date variables. recipes::step_rm()
can be used for this purpose.
Period Specification
The period argument is used to generate the distance between peaks in the fourier sequence. The
key is to line up the peaks with unique seasonalities in the data.
For Daily Data, typical period specifications are:
• Yearly frequency is 365
• Quarterly frequency is 365 / 4 = 91.25
• Monthly frequency is 365 / 12 = 30.42
K Specification
The K argument specifies the maximum number of orders of Fourier terms. Examples:
• Specifying period = 365 and K = 1 will return a cos365_K1 and sin365_K1 fourier series
• Specifying period = 365 and K = 2 will return a cos365_K1, cos365_K2, sin365_K1 and
sin365_K2 sequence, which tends to increase the models ability to fit vs the K = 1 specifi-
cation (at the expense of possibly overfitting).
Multiple values of period and K
It’s possible to specify multiple values of period in a single step such as step_fourier(period = c(91.25, 365), K = 2.
This returns 8 Fouriers series:
• cos91.25_K1, sin91.25_K1, cos91.25_K2, sin91.25_K2
• cos365_K1, sin365_K1, cos365_K2, sin365_K2
90 step_fourier
Value
For step_fourier, an updated version of recipe with the new step added to the sequence of ex-
isting steps (if any). For the tidy method, a tibble with columns terms (the selectors or variables
selected), value (the feature names).
See Also
Time Series Analysis:
• recipes::recipe()
• recipes::prep()
• recipes::bake()
Examples
library(recipes)
library(dplyr)
# Tidy shows which features have been added during the 1st step
step_holiday_signature 91
step_holiday_signature
Holiday Feature (Signature) Generator
Description
step_holiday_signature creates a a specification of a recipe step that will convert date or date-
time data into many holiday features that can aid in machine learning with time-series data. By
default, many features are returned for different holidays, locales, and stock exchanges.
Usage
step_holiday_signature(
recipe,
...,
holiday_pattern = ".",
locale_set = "all",
exchange_set = "all",
role = "predictor",
trained = FALSE,
columns = NULL,
features = NULL,
skip = FALSE,
id = rand_id("holiday_signature")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more selector functions to choose which variables that will be used to cre-
ate the new variables. The selected variables should have class Date or POSIXct.
See recipes::selections() for more details. For the tidy method, these are
not currently used.
holiday_pattern
A regular expression pattern to search the "Holiday Set".
locale_set Return binary holidays based on locale. One of: "all", "none", "World", "US",
"CA", "GB", "FR", "IT", "JP", "CH", "DE".
92 step_holiday_signature
exchange_set Return binary holidays based on Stock Exchange Calendars. One of: "all",
"none", "NYSE", "LONDON", "NERC", "TSX", "ZURICH".
role For model terms created by this step, what analysis role should they be as-
signed?. By default, the function assumes that the new variable columns created
by the original variables will be used as predictors in a model.
trained A logical to indicate if the quantities for preprocessing have been estimated.
columns A character string of variables that will be used as inputs. This field is a place-
holder and will be populated once recipes::prep() is used.
features A character string of features that will be generated. This field is a placeholder
and will be populated once recipes::prep() is used.
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations may
not be able to be conducted on new data (e.g. processing the outcome vari-
able(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
id A character string that is unique to this step to identify it.
x A step_holiday_signature object.
Details
Use Holiday Pattern and Feature Sets to Pare Down Features By default, you’re going to get A
LOT of Features. This is a good thing because many machine learning algorithms have regulariza-
tion built in. But, in many cases you will still want to reduce the number of unnecessary features.
Here’s how:
• Holiday Pattern: This is a Regular Expression pattern that can be used to filter. Try holiday_pattern
= "(US_Christ)|(US_Thanks)" to return just Christmas and Thanksgiving features.
• Locale Sets: This is a logical as to whether or not the locale has a holiday. For locales outside
of US you may want to combine multiple locales. For example, locale_set = c("World",
"GB") returns both World Holidays and Great Britain.
• Exchange Sets: This is a logical as to whether or not the Business is off due to a holiday.
Different Stock Exchanges are used as a proxy for business holiday calendars. For example,
exchange_set = "NYSE" returns business holidays for New York Stock Exchange.
Removing Unnecessary Features By default, many features are created automatically. Unneces-
sary features can be removed using recipes::step_rm() and recipes::selections() for more
details.
Value
For step_holiday_signature, an updated version of recipe with the new step added to the se-
quence of existing steps (if any). For the tidy method, a tibble with columns terms (the selectors
or variables selected), value (the feature names).
step_holiday_signature 93
See Also
Time Series Analysis:
• Engineered Features: step_timeseries_signature(), step_holiday_signature(), step_fourier()
• Diffs & Lags step_diff(), recipes::step_lag()
• Smoothing: step_slidify(), step_smooth()
• Variance Reduction: step_box_cox()
• Imputation: step_ts_impute(), step_ts_clean()
• Padding: step_ts_pad()
Main Recipe Functions:
• recipes::recipe()
• recipes::prep()
• recipes::bake()
Examples
library(recipes)
library(dplyr)
# Sample Data
dates_in_2017_tbl <- tibble::tibble(
index = tk_make_timeseries("2017-01-01", "2017-12-31", by = "day")
)
Description
step_log_interval creates a specification of a recipe step that will transform data using a Log-
Inerval transformation. This function provides a recipes interface for the log_interval_vec()
transformation function.
Usage
step_log_interval(
recipe,
...,
limit_lower = "auto",
limit_upper = "auto",
offset = 0,
role = NA,
trained = FALSE,
limit_lower_trained = NULL,
limit_upper_trained = NULL,
skip = FALSE,
id = rand_id("log_interval")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more selector functions to choose which variables are affected by the
step. See selections() for more details. For the tidy method, these are not
currently used.
limit_lower A lower limit. Must be less than the minimum value. If set to "auto", selects
zero.
limit_upper An upper limit. Must be greater than the maximum value. If set to "auto", selects
a value that is 10% greater than the maximum value.
offset An offset to include in the log transformation. Useful when the data contains
values less than or equal to zero.
role Not used by this step since no new variables are created.
trained A logical to indicate if the quantities for preprocessing have been estimated.
limit_lower_trained
A numeric vector of transformation values. This is NULL until computed by
prep().
step_log_interval 95
limit_upper_trained
A numeric vector of transformation values. This is NULL until computed by
prep().
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations
may not be able to be conducted on new data (e.g. processing the outcome
variable(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
id A character string that is unique to this step to identify it.
x A step_log_interval object.
Details
The step_log_interval() function is designed specifically to handle time series using methods
implemented in the Forecast R Package.
Positive Data
If data includes values of zero, use offset to adjust the series to make the values positive.
Implementation
Refer to the log_interval_vec() function for the transformation implementation details.
Value
An updated version of recipe with the new step added to the sequence of existing steps (if any).
For the tidy method, a tibble with columns terms (the selectors or variables selected) and value
(the lambda estimate).
See Also
Time Series Analysis:
• Engineered Features: step_timeseries_signature(), step_holiday_signature(), step_fourier()
• Diffs & Lags step_diff(), recipes::step_lag()
• Smoothing: step_slidify(), step_smooth()
• Variance Reduction: step_log_interval()
• Imputation: step_ts_impute(), step_ts_clean()
• Padding: step_ts_pad()
Transformations to reduce variance:
• recipes::step_log() - Log transformation
• recipes::step_sqrt() - Square-Root Power Transformation
Recipe Setup and Application:
• recipes::recipe()
• recipes::prep()
• recipes::bake()
96 step_slidify
Examples
library(dplyr)
library(recipes)
recipe_log_interval %>%
bake(FANG_wide) %>%
tidyr::pivot_longer(-date) %>%
plot_time_series(date, value, name, .smooth = FALSE, .interactive = FALSE)
Description
step_slidify creates a a specification of a recipe step that will apply a function to one or more a
Numeric column(s).
Usage
step_slidify(
recipe,
...,
period,
.f,
align = c("center", "left", "right"),
partial = FALSE,
names = NULL,
role = "predictor",
trained = FALSE,
columns = NULL,
f_name = NULL,
skip = FALSE,
id = rand_id("slidify")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more numeric columns to be smoothed. See recipes::selections()
for more details. For the tidy method, these are not currently used.
period The number of periods to include in the local rolling window. This is effectively
the "window size".
.f A summary formula in one of the following formats:
• mean with no arguments
• function(x) mean(x, na.rm = TRUE)
• ~ mean(.x, na.rm = TRUE), it is converted to a function.
align Rolling functions generate period - 1 fewer values than the incoming vector.
Thus, the vector needs to be aligned. Alignment of the vector follows 3 types:
• Center: NA or .partial values are divided and added to the beginning
and end of the series to "Center" the moving average. This is common for
de-noising operations. See also [smooth_vec()] for LOESS without NA
values.
• Left: NA or .partial values are added to the end to shift the series to the
Left.
• Right: NA or .partial values are added to the beginning to shif the series
to the Right. This is common in Financial Applications such as moving
average cross-overs.
partial Should the moving window be allowed to return partial (incomplete) windows
instead of NA values. Set to FALSE by default, but can be switched to TRUE to
remove NA’s.
names An optional character string that is the same length of the number of terms se-
lected by terms. These will be the names of the new columns created by the
step.
• If NULL, existing columns are transformed.
• If not NULL, new columns will be created.
role For model terms created by this step, what analysis role should they be as-
signed?. By default, the function assumes that the new variable columns created
by the original variables will be used as predictors in a model.
trained A logical to indicate if the quantities for preprocessing have been estimated.
columns A character string of variables that will be used as inputs. This field is a place-
holder and will be populated once recipes::prep() is used.
f_name A character string for the function being applied. This field is a placeholder and
will be populated during the tidy() step.
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations may
not be able to be conducted on new data (e.g. processing the outcome vari-
able(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
id A character string that is unique to this step to identify it.
x A step_slidify object.
98 step_slidify
Details
Alignment
Rolling functions generate period - 1 fewer values than the incoming vector. Thus, the vector
needs to be aligned. Alignment of the vector follows 3 types:
• Center: NA or partial values are divided and added to the beginning and end of the se-
ries to "Center" the moving average. This is common for de-noising operations. See also
[smooth_vec()] for LOESS without NA values.
• Left: NA or partial values are added to the end to shift the series to the Left.
• Right: NA or partial values are added to the beginning to shif the series to the Right. This is
common in Financial Applications such as moving average cross-overs.
Partial Values
• The advantage to using partial values vs NA padding is that the series can be filled (good for
time-series de-noising operations).
• The downside to partial values is that the partials can become less stable at the regions where
incomplete windows are used.
Value
For step_slidify, an updated version of recipe with the new step added to the sequence of ex-
isting steps (if any). For the tidy method, a tibble with columns terms (the selectors or variables
selected), value (the feature names).
See Also
Time Series Analysis:
• recipes::recipe()
• recipes::prep()
• recipes::bake()
step_slidify_augment 99
Examples
library(recipes)
library(dplyr)
library(ggplot2)
# Training Data
FB_tbl <- FANG %>%
filter(symbol == "FB") %>%
select(symbol, date, adjusted)
# New Data - Make some fake new data next 90 time stamps
new_data <- FB_tbl %>%
tail(90) %>%
mutate(date = date %>% tk_make_future_timeseries(length_out = 90))
# Visualize effect
training_data_baked %>%
ggplot(aes(date, adjusted)) +
geom_line() +
geom_line(color = "red", data = new_data_baked)
Description
step_slidify_augment creates a a specification of a recipe step that will "augment" (add multiple
new columns) that have had a sliding function applied.
Usage
step_slidify_augment(
recipe,
...,
period,
.f,
align = c("center", "left", "right"),
partial = FALSE,
prefix = "slidify_",
role = "predictor",
trained = FALSE,
columns = NULL,
f_name = NULL,
skip = FALSE,
id = rand_id("slidify_augment")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more numeric columns to be smoothed. See recipes::selections()
for more details. For the tidy method, these are not currently used.
period The number of periods to include in the local rolling window. This is effectively
the "window size".
.f A summary formula in one of the following formats:
• mean with no arguments
• function(x) mean(x, na.rm = TRUE)
• ~ mean(.x, na.rm = TRUE), it is converted to a function.
align Rolling functions generate period - 1 fewer values than the incoming vector.
Thus, the vector needs to be aligned. Alignment of the vector follows 3 types:
• Center: NA or .partial values are divided and added to the beginning
and end of the series to "Center" the moving average. This is common for
de-noising operations. See also [smooth_vec()] for LOESS without NA
values.
• Left: NA or .partial values are added to the end to shift the series to the
Left.
step_slidify_augment 101
• Right: NA or .partial values are added to the beginning to shif the series
to the Right. This is common in Financial Applications such as moving
average cross-overs.
partial Should the moving window be allowed to return partial (incomplete) windows
instead of NA values. Set to FALSE by default, but can be switched to TRUE to
remove NA’s.
prefix A prefix for generated column names, default to "slidify_".
role For model terms created by this step, what analysis role should they be as-
signed?. By default, the function assumes that the new variable columns created
by the original variables will be used as predictors in a model.
trained A logical to indicate if the quantities for preprocessing have been estimated.
columns A character string of variable names that will be populated (eventually) by the
terms argument.
f_name A character string for the function being applied. This field is a placeholder and
will be populated during the tidy() step.
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations
may not be able to be conducted on new data (e.g. processing the outcome
variable(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations
id A character string that is unique to this step to identify it.
x A step_slidify_augment object.
Details
Alignment
Rolling functions generate period - 1 fewer values than the incoming vector. Thus, the vector
needs to be aligned. Alignment of the vector follows 3 types:
• Center: NA or partial values are divided and added to the beginning and end of the se-
ries to "Center" the moving average. This is common for de-noising operations. See also
[smooth_vec()] for LOESS without NA values.
• Left: NA or partial values are added to the end to shift the series to the Left.
• Right: NA or partial values are added to the beginning to shif the series to the Right. This is
common in Financial Applications such as moving average cross-overs.
Partial Values
• The advantage to using partial values vs NA padding is that the series can be filled (good for
time-series de-noising operations).
• The downside to partial values is that the partials can become less stable at the regions where
incomplete windows are used.
Value
For step_slidify_augment, an updated version of recipe with the new step added to the sequence
of existing steps (if any). For the tidy method, a tibble with columns terms (the selectors or
variables selected), value (the feature names).
See Also
Time Series Analysis:
• Engineered Features: step_timeseries_signature(), step_holiday_signature(), step_fourier()
• Diffs & Lags step_diff(), recipes::step_lag()
• Smoothing: step_slidify(), step_smooth()
• Variance Reduction: step_box_cox()
• Imputation: step_ts_impute(), step_ts_clean()
• Padding: step_ts_pad()
Main Recipe Functions:
• recipes::recipe()
• recipes::prep()
• recipes::bake()
Examples
# library(tidymodels)
library(dplyr)
library(recipes)
library(parsnip)
# Make a recipe
recipe_spec <- recipe(value ~ date + value_2, rsample::training(m750_splits)) %>%
step_slidify_augment(
value, value_2,
period = c(6, 12, 24),
.f = ~ mean(.x),
align = "center",
partial = FALSE
)
bake(prep(recipe_spec), rsample::testing(m750_splits))
step_smooth 103
Description
step_smooth creates a a specification of a recipe step that will apply local polynomial regression
to one or more a Numeric column(s). The effect is smoothing the time series similar to a moving
average without creating missing values or using partial smoothing.
Usage
step_smooth(
recipe,
...,
period = 30,
span = NULL,
degree = 2,
names = NULL,
role = "predictor",
trained = FALSE,
columns = NULL,
skip = FALSE,
id = rand_id("smooth")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more numeric columns to be smoothed. See recipes::selections()
for more details. For the tidy method, these are not currently used.
period The number of periods to include in the local smoothing. Similar to window
size for a moving average. See details for an explanation period vs span spec-
ification.
span The span is a percentage of data to be included in the smoothing window. Pe-
riod is preferred for shorter windows to fix the window size. See details for an
explanation period vs span specification.
degree The degree of the polynomials to be used. Set to 2 by default for 2nd order
polynomial.
names An optional character string that is the same length of the number of terms se-
lected by terms. These will be the names of the new columns created by the
step.
104 step_smooth
Details
Smoother Algorithm This function is a recipe specification that wraps the stats::loess() with
a modification to set a fixed period rather than a percentage of data points via a span.
Why Period vs Span? The period is fixed whereas the span changes as the number of observations
change.
When to use Period? The effect of using a period is similar to a Moving Average where the
Window Size is the Fixed Period. This helps when you are trying to smooth local trends. If you
want a 30-day moving average, specify period = 30.
When to use Span? Span is easier to specify when you want a Long-Term Trendline where the
window size is unknown. You can specify span = 0.75 to locally regress using a window of 75%
of the data.
Warning - Using Span with New Data When using span on New Data, the number of observations
is likely different than what you trained with. This means the trendline / smoother can be vastly
different than the smoother you trained with.
Solution to Span with New Data Don’t use span. Rather, use period to fix the window size. This
ensures that new data includes the same number of observations in the local polynomial regression
(loess) as the training data.
Value
For step_smooth, an updated version of recipe with the new step added to the sequence of exist-
ing steps (if any). For the tidy method, a tibble with columns terms (the selectors or variables
selected), value (the feature names).
See Also
Time Series Analysis:
Examples
library(recipes)
library(dplyr)
library(ggplot2)
# Training Data
FB_tbl <- FANG %>%
filter(symbol == "FB") %>%
select(symbol, date, adjusted)
# New Data - Make some fake new data next 90 time stamps
new_data <- FB_tbl %>%
tail(90) %>%
mutate(date = date %>% tk_make_future_timeseries(length_out = 90))
# Smoother's fit is not the same using span because new data is only 90 days
# and 0.03 x 90 = 2.7 days
training_data_baked %>%
ggplot(aes(date, adjusted)) +
geom_line() +
geom_line(color = "red", data = new_data_baked)
step_timeseries_signature
Time Series Feature (Signature) Generator
Description
step_timeseries_signature creates a a specification of a recipe step that will convert date or
date-time data into many features that can aid in machine learning with time-series data
Usage
step_timeseries_signature(
recipe,
...,
role = "predictor",
trained = FALSE,
step_timeseries_signature 107
columns = NULL,
skip = FALSE,
id = rand_id("timeseries_signature")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more selector functions to choose which variables that will be used to cre-
ate the new variables. The selected variables should have class Date or POSIXct.
See recipes::selections() for more details. For the tidy method, these are
not currently used.
role For model terms created by this step, what analysis role should they be as-
signed?. By default, the function assumes that the new variable columns created
by the original variables will be used as predictors in a model.
trained A logical to indicate if the quantities for preprocessing have been estimated.
columns A character string of variables that will be used as inputs. This field is a place-
holder and will be populated once recipes::prep() is used.
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations may
not be able to be conducted on new data (e.g. processing the outcome vari-
able(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
id A character string that is unique to this step to identify it.
x A step_timeseries_signature object.
Details
Date Variable Unlike other steps, step_timeseries_signature does not remove the original date
variables. recipes::step_rm() can be used for this purpose.
Scaling index.num The index.num feature created has a large magnitude (number of seconds since
1970-01-01). It’s a good idea to scale and center this feature (e.g. use recipes::step_normalize()).
Removing Unnecessary Features By default, many features are created automatically. Unneces-
sary features can be removed using recipes::step_rm().
Value
For step_timeseries_signature, an updated version of recipe with the new step added to the
sequence of existing steps (if any). For the tidy method, a tibble with columns terms (the selectors
or variables selected), value (the feature names).
108 step_timeseries_signature
See Also
• recipes::recipe()
• recipes::prep()
• recipes::bake()
Examples
library(recipes)
library(dplyr)
# Tidy shows which features have been added during the 1st step
# in this case, step 1 is the step_timeseries_signature step
tidy(rec_obj)
tidy(rec_obj, number = 1)
step_ts_clean 109
Description
step_ts_clean creates a specification of a recipe step that will clean outliers and impute time
series data.
Usage
step_ts_clean(
recipe,
...,
period = 1,
lambda = "auto",
role = NA,
trained = FALSE,
lambdas_trained = NULL,
skip = FALSE,
id = rand_id("ts_clean")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more selector functions to choose which variables are affected by the
step. See selections() for more details. For the tidy method, these are not
currently used.
period A seasonal period to use during the transformation. If period = 1, linear in-
terpolation is performed. If period > 1, a robust STL decomposition is first
performed and a linear interpolation is applied to the seasonally adjusted data.
lambda A box cox transformation parameter. If set to "auto", performs automated
lambda selection.
role Not used by this step since no new variables are created.
trained A logical to indicate if the quantities for preprocessing have been estimated.
lambdas_trained
A named numeric vector of lambdas. This is NULL until computed by recipes::prep().
Note that, if the original data are integers, the mean will be converted to an inte-
ger to maintain the same a data type.
110 step_ts_clean
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations
may not be able to be conducted on new data (e.g. processing the outcome
variable(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
id A character string that is unique to this step to identify it.
x A step_ts_clean object.
Details
The step_ts_clean() function is designed specifically to handle time series using seasonal outlier
detection methods implemented in the Forecast R Package.
Cleaning Outliers
#’ Outliers are replaced with missing values using the following methods:
1. Period is 1: With period = 1, a seasonality cannot be interpreted and therefore linear is used.
2. Number of Non-Missing Values is less than 2-Periods: Insufficient values exist to detect
seasonality.
3. Number of Total Values is less than 3-Periods: Insufficient values exist to detect seasonality.
Value
An updated version of recipe with the new step added to the sequence of existing steps (if any).
For the tidy method, a tibble with columns terms (the selectors or variables selected) and value
(the lambda estimate).
References
• Forecast R Package
• Forecasting Principles & Practices: Dealing with missing values and outliers
step_ts_impute 111
See Also
Examples
library(dplyr)
library(tidyr)
library(recipes)
FANG_wide
# Apply Imputation
recipe_box_cox <- recipe(~ ., data = FANG_wide) %>%
step_ts_clean(FB, AMZN, NFLX, GOOG, period = 252) %>%
prep()
Description
step_ts_impute creates a specification of a recipe step that will impute time series data.
112 step_ts_impute
Usage
step_ts_impute(
recipe,
...,
period = 1,
lambda = NULL,
role = NA,
trained = FALSE,
lambdas_trained = NULL,
skip = FALSE,
id = rand_id("ts_impute")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... One or more selector functions to choose which variables are affected by the
step. See selections() for more details. For the tidy method, these are not
currently used.
period A seasonal period to use during the transformation. If period = 1, linear in-
terpolation is performed. If period > 1, a robust STL decomposition is first
performed and a linear interpolation is applied to the seasonally adjusted data.
lambda A box cox transformation parameter. If set to "auto", performs automated
lambda selection.
role Not used by this step since no new variables are created.
trained A logical to indicate if the quantities for preprocessing have been estimated.
lambdas_trained
A named numeric vector of lambdas. This is NULL until computed by recipes::prep().
Note that, if the original data are integers, the mean will be converted to an inte-
ger to maintain the same a data type.
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations
may not be able to be conducted on new data (e.g. processing the outcome
variable(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
id A character string that is unique to this step to identify it.
x A step_ts_impute object.
Details
The step_ts_impute() function is designed specifically to handle time series
step_ts_impute 113
1. Period is 1: With period = 1, a seasonality cannot be interpreted and therefore linear is used.
2. Number of Non-Missing Values is less than 2-Periods: Insufficient values exist to detect
seasonality.
3. Number of Total Values is less than 3-Periods: Insufficient values exist to detect seasonality.
Value
An updated version of recipe with the new step added to the sequence of existing steps (if any).
For the tidy method, a tibble with columns terms (the selectors or variables selected) and value
(the lambda estimate).
References
• Forecast R Package
• Forecasting Principles & Practices: Dealing with missing values and outliers
See Also
Time Series Analysis:
• recipes::recipe()
• recipes::prep()
• recipes::bake()
114 step_ts_pad
Examples
library(dplyr)
library(recipes)
FANG_wide
# Apply Imputation
recipe_box_cox <- recipe(~ ., data = FANG_wide) %>%
step_ts_impute(FB, AMZN, NFLX, GOOG, period = 252, lambda = "auto") %>%
prep()
step_ts_pad Pad: Add rows to fill gaps and go from low to high frequency
Description
step_ts_pad creates a a specification of a recipe step that will analyze a Date or Date-time column
adding rows at a specified interval.
Usage
step_ts_pad(
recipe,
...,
by = "day",
pad_value = NA,
role = "predictor",
trained = FALSE,
columns = NULL,
skip = FALSE,
id = rand_id("ts_padding")
)
Arguments
recipe A recipe object. The step will be added to the sequence of operations for this
recipe.
... A single column with class Date or POSIXct. See recipes::selections() for
more details. For the tidy method, these are not currently used.
by Either "auto", a time-based frequency like "year", "month", "day", "hour", etc,
or a time expression like "5 min", or "7 days". See Details.
pad_value Fills in padded values. Default is NA.
role For model terms created by this step, what analysis role should they be as-
signed?. By default, the function assumes that the new variable columns created
by the original variables will be used as predictors in a model.
trained A logical to indicate if the quantities for preprocessing have been estimated.
columns A character string of variables that will be used as inputs. This field is a place-
holder and will be populated once recipes::prep() is used.
skip A logical. Should the step be skipped when the recipe is baked by bake.recipe()?
While all operations are baked when prep.recipe() is run, some operations may
not be able to be conducted on new data (e.g. processing the outcome vari-
able(s)). Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
id A character string that is unique to this step to identify it.
x A step_ts_pad object.
Details
Date Variable
• The eight intervals in are: year, quarter, month, week, day, hour, min, and sec.
• Intervals like 30 minutes, 1 hours, 14 days are possible.
• Numeric data: The step_ts_impute() preprocessing step can be used to impute numeric
time series data with or without seasonality
• Nominal data: The step_mode_impute() preprocessing step can be used to replace missing
values with the most common value.
116 step_ts_pad
Value
For step_ts_pad, an updated version of recipe with the new step added to the sequence of exist-
ing steps (if any). For the tidy method, a tibble with columns terms (the selectors or variables
selected), value (the feature names).
See Also
Padding & Imputation:
• Pad Time Series: step_ts_pad()
• Impute missing values with these: step_ts_impute(), step_ts_clean()
Time Series Analysis:
• Engineered Features: step_timeseries_signature(), step_holiday_signature(), step_fourier()
• Diffs & Lags step_diff(), recipes::step_lag()
• Smoothing: step_slidify(), step_smooth()
• Variance Reduction: step_box_cox()
Main Recipe Functions:
• recipes::recipe()
• recipes::prep
• recipes::bake
Examples
library(recipes)
library(dplyr)
# Tidy shows which features have been added during the 1st step
# in this case, step 1 is the step_timeseries_signature step
tidy(prep(rec_obj))
tidy(prep(rec_obj), number = 1)
summarise_by_time 117
Description
summarise_by_time() is a time-based variant of the popular dplyr::summarise() function that
uses .date_var to specify a date or date-time column and .by to group the calculation by groups
like "5 seconds", "week", or "3 months".
summarise_by_time() and summarize_by_time() are synonyms.
Usage
summarise_by_time(
.data,
.date_var,
.by = "day",
...,
.type = c("floor", "ceiling", "round"),
.week_start = NULL
)
summarize_by_time(
.data,
.date_var,
.by = "day",
...,
.type = c("floor", "ceiling", "round"),
.week_start = NULL
)
Arguments
.data A tbl object or data.frame
.date_var A column containing date or date-time values to summarize. If missing, attempts
to auto-detect date column.
.by A time unit to summarise by. Time units are collapsed using lubridate::floor_date()
or lubridate::ceiling_date().
The value can be:
• second
• minute
• hour
• day
• week
• month
• bimonth
118 summarise_by_time
• quarter
• season
• halfyear
• year
Arbitrary unique English abbreviations as in the lubridate::period() con-
structor are allowed.
... Name-value pairs of summary functions. The name will be the name of the
variable in the result.
The value can be:
• A vector of length 1, e.g. min(x), n(), or sum(is.na(y)).
• A vector of length n, e.g. quantile().
• A data frame, to add multiple columns from a single expression.
.type One of "floor", "ceiling", or "round. Defaults to "floor". See lubridate::round_date.
.week_start when unit is weeks, specify the reference day. 7 represents Sunday and 1 repre-
sents Monday.
Value
A tibble or data.frame
See Also
Time-Based dplyr functions:
Examples
# Libraries
library(dplyr)
# Last value in each month (day is first day of next month with ceiling option)
m4_daily %>%
group_by(id) %>%
summarise_by_time(
.by = "month",
value = last(value),
.type = "ceiling"
) %>%
# Shift to the last day of the month
mutate(date = date %-time% "1 day")
Description
Half-hourly electricity demand in England and Wales from Monday 5 June 2000 to Sunday 27
August 2000. Discussed in Taylor (2003).
Usage
taylor_30_min
120 time_arithmetic
Format
A tibble: 4,032 x 2
Source
James W Taylor
References
Taylor, J.W. (2003) Short-term electricity demand forecasting using double seasonal exponential
smoothing. Journal of the Operational Research Society, 54, 799-805.
Examples
taylor_30_min
Description
The easiest way to add / subtract a period to a time series date or date-time vector.
Usage
add_time(index, period)
subtract_time(index, period)
Arguments
Details
A convenient wrapper for lubridate::period(). Adds and subtracts a period from a time-based
index. Great for:
• Finding a timestamp n-periods into the future or past
• Shifting a time-based index. Note that NA values may be present where dates don’t exist.
Period Specification
The period argument accepts complex strings like:
• "1 month 4 days 43 minutes"
• "second = 3, minute = 1, hour = 2, day = 13, week = 1"
Value
A date or datetime (POSIXct) vector the same length as index with the time values shifted +/- a
period.
See Also
Other Time-Based vector functions:
• between_time() - Range detection for date or date-time sequences.
Underlying function:
• lubridate::period()
Examples
# ADD TIME
# - Note `NA` values created where a daily dates aren't possible
# (e.g. Feb 29 & 30, 2016 doesn't exist).
index_daily %+time% "1 month"
122 time_series_cv
# Subtracting Time
index_daily %-time% "1 month"
Description
Create rsample cross validation sets for time series. This function produces a sampling plan starting
with the most recent time series observations, rolling backwards. The sampling procedure is similar
to rsample::rolling_origin(), but places the focus of the cross validation on the most recent
time series data.
Usage
time_series_cv(
data,
date_var = NULL,
initial = 5,
assess = 1,
skip = 1,
lag = 0,
cumulative = FALSE,
slice_limit = n(),
point_forecast = FALSE,
...
)
Arguments
data A data frame.
date_var A date or date-time variable.
initial The number of samples used for analysis/modeling in the initial resample.
assess The number of samples used for each assessment resample.
skip A integer indicating how many (if any) additional resamples to skip to thin the
total amount of data points in the analysis resample. See the example below.
lag A value to include an lag between the assessment and analysis set. This is useful
if lagged predictors will be used during training and testing.
cumulative A logical. Should the analysis resample grow beyond the size specified by
initial at each resample?.
time_series_cv 123
slice_limit The number of slices to return. Set to dplyr::n(), which returns the maximum
number of slices.
point_forecast Whether or not to have the testing set be a single point forecast or to be a forecast
horizon. The default is to be a forecast horizon. Default: FALSE
... These dots are for future extensions and must be empty.
Details
Time-Based Specification
The initial, assess, skip, and lag variables can be specified as:
• Numeric: initial = 24
• Time-Based Phrases: initial = "2 years", if the data contains a date_var (date or date-
time)
Value
An tibble with classes time_series_cv, rset, tbl_df, tbl, and data.frame. The results include a
column for the data split objects and a column called id that has a character string with the resample
identifier.
See Also
• time_series_cv() and rsample::rolling_origin() - Functions used to create time series
resample specifications.
• plot_time_series_cv_plan() - The plotting function used for visualizing the time series
resample plan.
• time_series_split() - A convenience function to return a single time series split containing
a training/testing sample.
Examples
library(dplyr)
# DATA ----
m750 <- m4_monthly %>% dplyr::filter(id == "M750")
resample_spec
# Select date and value columns from the tscv diagnostic tool
resample_spec %>% tk_time_series_cv_plan()
walmart_tscv
walmart_tscv %>%
plot_time_series_cv_plan(Date, Weekly_Sales, .interactive = FALSE)
Description
time_series_split creates resample splits using time_series_cv() but returns only a single
split. This is useful when creating a single train/test split.
Usage
time_series_split(
data,
date_var = NULL,
initial = 5,
assess = 1,
skip = 1,
lag = 0,
cumulative = FALSE,
slice = 1,
point_forecast = FALSE,
...
)
Arguments
data A data frame.
date_var A date or date-time variable.
126 time_series_split
initial The number of samples used for analysis/modeling in the initial resample.
assess The number of samples used for each assessment resample.
skip A integer indicating how many (if any) additional resamples to skip to thin the
total amount of data points in the analysis resample. See the example below.
lag A value to include an lag between the assessment and analysis set. This is useful
if lagged predictors will be used during training and testing.
cumulative A logical. Should the analysis resample grow beyond the size specified by
initial at each resample?.
slice Returns a single slice from time_series_cv
point_forecast Whether or not to have the testing set be a single point forecast or to be a forecast
horizon. The default is to be a forecast horizon. Default: FALSE
... These dots are for future extensions and must be empty.
Details
Time-Based Specification
The initial, assess, skip, and lag variables can be specified as:
• Numeric: initial = 24
• Time-Based Phrases: initial = "2 years", if the data contains a date_var (date or date-
time)
Value
An rsplit object that can be used with the training and testing functions to extract the data in
each split.
See Also
• time_series_cv() and rsample::rolling_origin() - Functions used to create time series
resample specifications.
Examples
library(dplyr)
# DATA ----
m750 <- m4_monthly %>% dplyr::filter(id == "M750")
# Get the most recent 3 years as testing, and previous 10 years as training
m750 %>%
time_series_split(initial = "10 years", assess = "3 years")
Description
The tk_acf_diagnostics() function provides a simple interface to detect Autocorrelation (ACF),
Partial Autocorrelation (PACF), and Cross Correlation (CCF) of Lagged Predictors in one tibble.
This function powers the plot_acf_diagnostics() visualization.
128 tk_acf_diagnostics
Usage
tk_acf_diagnostics(.data, .date_var, .value, .ccf_vars = NULL, .lags = 1000)
Arguments
.data A data frame or tibble with numeric features (values) in descending chronolog-
ical order
.date_var A column containing either date or date-time values
.value A numeric column with a value to have ACF and PACF calculations performed.
.ccf_vars Additional features to perform Lag Cross Correlations (CCFs) versus the .value.
Useful for evaluating external lagged regressors.
.lags A seqence of one or more lags to evaluate.
Details
Simplified ACF, PACF, & CCF
We are often interested in all 3 of these functions. Why not get all 3 at once? Now you can!
Lag Specification
Lags (.lags) can either be specified as:
Value
A tibble or data.frame containing the autocorrelation, partial autocorrelation and cross correla-
tion data.
tk_anomaly_diagnostics 129
See Also
• Visualizing ACF, PACF, & CCF: plot_acf_diagnostics()
• Visualizing Seasonality: plot_seasonal_diagnostics()
• Visualizing Time Series: plot_time_series()
Examples
library(dplyr)
# Apply Transformations
FANG %>%
group_by(symbol) %>%
tk_acf_diagnostics(
date, diff_vec(adjusted), # Apply differencing transformation
.lags = 0:500
)
tk_anomaly_diagnostics
Automatic group-wise Anomaly Detection by STL Decomposition
Description
tk_anomaly_diagnostics() is the preprocessor for plot_anomaly_diagnostics(). It performs
automatic anomaly detection for one or more time series groups.
Usage
tk_anomaly_diagnostics(
.data,
130 tk_anomaly_diagnostics
.date_var,
.value,
.frequency = "auto",
.trend = "auto",
.alpha = 0.05,
.max_anomalies = 0.2,
.message = TRUE
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
.frequency Controls the seasonal adjustment (removal of seasonality). Input can be either
"auto", a time-based definition (e.g. "2 weeks"), or a numeric number of obser-
vations per frequency (e.g. 10). Refer to tk_get_frequency().
.trend Controls the trend component. For STL, trend controls the sensitivity of the
LOESS smoother, which is used to remove the remainder. Refer to tk_get_trend().
.alpha Controls the width of the "normal" range. Lower values are more conservative
while higher values are less prone to incorrectly classifying "normal" observa-
tions.
.max_anomalies The maximum percent of anomalies permitted to be identified.
.message A boolean. If TRUE, will output information related to automatic frequency and
trend selection (if applicable).
Details
The tk_anomaly_diagnostics() method for anomaly detection that implements a 2-step process
to detect outliers in time series.
Step 1: Detrend & Remove Seasonality using STL Decomposition
The decomposition separates the "season" and "trend" components from the "observed" values leav-
ing the "remainder" for anomaly detection.
The user can control two parameters: frequency and trend.
1. .frequency: Adjusts the "season" component that is removed from the "observed" values.
2. .trend: Adjusts the trend window (t.window parameter from stats::stl() that is used.
The user may supply both .frequency and .trend as time-based durations (e.g. "6 weeks") or
numeric values (e.g. 180) or "auto", which predetermines the frequency and/or trend based on the
scale of the time series using the tk_time_scale_template().
Step 2: Anomaly Detection
Once "trend" and "season" (seasonality) is removed, anomaly detection is performed on the "re-
mainder". Anomalies are identified, and boundaries (recomposed_l1 and recomposed_l2) are de-
termined.
tk_anomaly_diagnostics 131
The Anomaly Detection Method uses an inner quartile range (IQR) of +/-25 the median.
IQR Adjustment, alpha parameter
With the default alpha = 0.05, the limits are established by expanding the 25/75 baseline by an
IQR Factor of 3 (3X). The IQR Factor = 0.15 / alpha (hence 3X with alpha = 0.05):
• To increase the IQR Factor controlling the limits, decrease the alpha, which makes it more
difficult to be an outlier.
• Increase alpha to make it easier to be an outlier.
• The IQR outlier detection method is used in forecast::tsoutliers().
• A similar outlier detection method is used by Twitter’s AnomalyDetection package.
• Both Twitter and Forecast tsoutliers methods have been implemented in Business Science’s
anomalize package.
Value
A tibble or data.frame with STL Decomposition Features (observed, season, trend, remainder,
seasadj) and Anomaly Features (remainder_l1, remainder_l2, anomaly, recomposed_l1, and recom-
posed_l2)
References
See Also
Examples
library(dplyr)
walmart_sales_weekly %>%
filter(id %in% c("1_1", "1_3")) %>%
group_by(id) %>%
tk_anomaly_diagnostics(Date, Weekly_Sales)
132 tk_augment_differences
tk_augment_differences
Add many differenced columns to the data
Description
A handy function for adding multiple lagged difference values to a data frame. Works with dplyr
groups too.
Usage
tk_augment_differences(
.data,
.value,
.lags = 1,
.differences = 1,
.log = FALSE,
.names = "auto"
)
Arguments
.data A tibble.
.value One or more column(s) to have a transformation applied. Usage of tidyselect
functions (e.g. contains()) can be used to select multiple columns.
.lags One or more lags for the difference(s)
.differences The number of differences to apply.
.log If TRUE, applies log-differences.
.names A vector of names for the new columns. Must be of same length as the number
of output columns. Use "auto" to automatically rename the columns.
Details
Benefits
This is a scalable function that is:
Value
Returns a tibble object describing the timeseries.
tk_augment_fourier 133
See Also
Augment Operations:
Underlying Function:
Examples
library(dplyr)
m4_monthly %>%
group_by(id) %>%
tk_augment_differences(value, .lags = 1:20)
Description
A handy function for adding multiple fourier series to a data frame. Works with dplyr groups too.
Usage
tk_augment_fourier(.data, .date_var, .periods, .K = 1, .names = "auto")
Arguments
.data A tibble.
.date_var A date or date-time column used to calculate a fourier series
.periods One or more periods for the fourier series
.K The maximum number of fourier orders.
.names A vector of names for the new columns. Must be of same length as the number
of output columns. Use "auto" to automatically rename the columns.
134 tk_augment_holiday
Details
Benefits
This is a scalable function that is:
Value
Returns a tibble object describing the timeseries.
See Also
Augment Operations:
Underlying Function:
Examples
library(dplyr)
m4_monthly %>%
group_by(id) %>%
tk_augment_fourier(date, .periods = c(6, 12), .K = 2)
Description
Quickly add the "holiday signature" - sets of holiday features that correspond to calendar dates.
Works with dplyr groups too.
tk_augment_holiday 135
Usage
tk_augment_holiday_signature(
.data,
.date_var = NULL,
.holiday_pattern = ".",
.locale_set = c("all", "none", "World", "US", "CA", "GB", "FR", "IT", "JP", "CH", "DE"),
.exchange_set = c("all", "none", "NYSE", "LONDON", "NERC", "TSX", "ZURICH")
)
Arguments
.data A time-based tibble or time-series object.
.date_var A column containing either date or date-time values. If NULL, the time-based
column will interpret from the object (tibble).
.holiday_pattern
A regular expression pattern to search the "Holiday Set".
.locale_set Return binary holidays based on locale. One of: "all", "none", "World", "US",
"CA", "GB", "FR", "IT", "JP", "CH", "DE".
.exchange_set Return binary holidays based on Stock Exchange Calendars. One of: "all",
"none", "NYSE", "LONDON", "NERC", "TSX", "ZURICH".
Details
tk_augment_holiday_signature adds the holiday signature features. See tk_get_holiday_signature()
(powers the augment function) for a full description and examples for how to use.
1. Individual Holidays
These are single holiday features that can be filtered using a pattern. This helps in identifying
which holidays are important to a machine learning model. This can be useful for example in
e-commerce initiatives (e.g. sales during Christmas and Thanskgiving).
2. Locale-Based Summary Sets
Locale-based holdiay sets are useful for e-commerce initiatives (e.g. sales during Christmas and
Thanskgiving). Filter on a locale to identify all holidays in that locale.
3. Stock Exchange Calendar Summary Sets
Exchange-based holdiay sets are useful for identifying non-working days. Filter on an index to
identify all holidays that are commonly non-working.
Value
Returns a tibble object describing the holiday timeseries.
See Also
Augment Operations:
• tk_augment_timeseries_signature() - Group-wise augmentation of timestamp features
• tk_augment_holiday_signature() - Group-wise augmentation of holiday features
136 tk_augment_lags
Examples
library(dplyr)
# All holidays in US
dates_in_2017_tbl %>%
tk_augment_holiday_signature(
index,
.holiday_pattern = "US_",
.locale_set = "US",
.exchange_set = "none")
Description
A handy function for adding multiple lagged columns to a data frame. Works with dplyr groups
too.
tk_augment_lags 137
Usage
tk_augment_lags(.data, .value, .lags = 1, .names = "auto")
Arguments
.data A tibble.
.value One or more column(s) to have a transformation applied. Usage of tidyselect
functions (e.g. contains()) can be used to select multiple columns.
.lags One or more lags for the difference(s)
.names A vector of names for the new columns. Must be of same length as .lags.
Details
Lags vs Leads
A negative lag is considered a lead. The tk_augment_leads() function is identical to tk_augment_lags()
with the exception that the automatic naming convetion (.names = 'auto') will convert column
names with negative lags to leads.
Benefits
This is a scalable function that is:
Value
Returns a tibble object describing the timeseries.
See Also
Augment Operations:
Underlying Function:
Examples
library(dplyr)
# Lags
m4_monthly %>%
group_by(id) %>%
tk_augment_lags(contains("value"), .lags = 1:20)
# Leads
m4_monthly %>%
group_by(id) %>%
tk_augment_leads(value, .lags = 1:-20)
Description
Quickly use any function as a rolling function and apply to multiple .periods. Works with dplyr
groups too.
Usage
tk_augment_slidify(
.data,
.value,
.period,
.f,
...,
.align = c("center", "left", "right"),
.partial = FALSE,
.names = "auto"
)
Arguments
.data A tibble.
.value One or more column(s) to have a transformation applied. Usage of tidyselect
functions (e.g. contains()) can be used to select multiple columns.
.period One or more periods for the rolling window(s)
.f A summary [function / formula],
... Optional arguments for the summary function
.align Rolling functions generate .period - 1 fewer values than the incoming vector.
Thus, the vector needs to be aligned. Select one of "center", "left", or "right".
tk_augment_slidify 139
.partial .partial Should the moving window be allowed to return partial (incomplete)
windows instead of NA values. Set to FALSE by default, but can be switched to
TRUE to remove NA’s.
.names A vector of names for the new columns. Must be of same length as .period.
Default is "auto".
Details
tk_augment_slidify() scales the slidify_vec() function to multiple time series .periods. See
slidify_vec() for examples and usage of the core function arguments.
Value
Returns a tibble object describing the timeseries.
See Also
Augment Operations:
Underlying Function:
Examples
library(dplyr)
Description
Usage
Arguments
Details
Value
See Also
Augment Operations:
Underlying Function:
Examples
library(dplyr)
m4_daily %>%
group_by(id) %>%
tk_augment_timeseries_signature(date)
tk_get_frequency Automatic frequency and trend calculation from a time series index
Description
Automatic frequency and trend calculation from a time series index
Usage
tk_get_frequency(idx, period = "auto", message = TRUE)
Arguments
idx A date or datetime index.
period Either "auto", a time-based definition (e.g. "2 weeks"), or a numeric number of
observations per frequency (e.g. 10).
message A boolean. If message = TRUE, the frequency or trend is output as a message
along with the units in the scale of the data.
142 tk_get_frequency
Details
A frequency is loosely defined as the number of observations that comprise a cycle in a data set.
The trend is loosely defined as time span that can be aggregated across to visualize the central
tendency of the data. It’s often easiest to think of frequency and trend in terms of the time-based
units that the data is already in. This is what tk_get_frequency() and time_trend() enable:
using time-based periods to define the frequency or trend.
Frequency:
As an example, a weekly cycle is often 5-days (for working days) or 7-days (for calendar days).
Rather than specify a frequency of 5 or 7, the user can specify period = "1 week", and tk_get_frequency()
will detect the scale of the time series and return 5 or 7 based on the actual data.
The period argument has three basic options for returning a frequency. Options include:
• "auto": A target frequency is determined using a pre-defined template (see template below).
• time-based duration: (e.g. "1 week" or "2 quarters" per cycle)
• numeric number of observations: (e.g. 5 for 5 observations per cycle)
Value
Returns a scalar numeric value indicating the number of observations in the frequency or trend span.
See Also
• Time Scale Template Modifiers: get_tk_time_scale_template(), set_tk_time_scale_template()
Examples
library(dplyr)
tk_get_holiday 143
Description
Usage
tk_get_holiday_signature(
idx,
holiday_pattern = ".",
locale_set = c("all", "none", "World", "US", "CA", "GB", "FR", "IT", "JP", "CH", "DE"),
exchange_set = c("all", "none", "NYSE", "LONDON", "NERC", "TSX", "ZURICH")
)
tk_get_holidays_by_year(years = year(today()))
Arguments
Details
Feature engineering holidays can help identify critical patterns for machine learning algorithms.
tk_get_holiday_signature() helps by providing feature sets for 3 types of features:
1. Individual Holidays
These are single holiday features that can be filtered using a pattern. This helps in identifying
which holidays are important to a machine learning model. This can be useful for example in
e-commerce initiatives (e.g. sales during Christmas and Thanskgiving).
2. Locale-Based Summary Sets
Locale-based holdiay sets are useful for e-commerce initiatives (e.g. sales during Christmas and
Thanskgiving). Filter on a locale to identify all holidays in that locale.
3. Stock Exchange Calendar Summary Sets
Exchange-based holdiay sets are useful for identifying non-working days. Filter on an index to
identify all holidays that are commonly non-working.
Value
Returns a tibble object describing the timeseries holidays.
See Also
• tk_augment_holiday_signature() - A quick way to add holiday features to a data.frame
• step_holiday_signature() - Preprocessing and feature engineering steps for use with recipes
Examples
library(dplyr)
library(stringr)
tk_get_holiday_signature(idx)
Description
Get date features from a time-series index
Usage
tk_get_timeseries_signature(idx)
tk_get_timeseries_summary(idx)
Arguments
idx A time-series index that is a vector of dates or datetimes.
Details
tk_get_timeseries_signature decomposes the timeseries into commonly needed features such
as numeric value, differences, year, month, day, day of week, day of month, day of year, hour,
minute, second.
tk_get_timeseries_summary returns the summary returns the start, end, units, scale, and a "sum-
mary" of the timeseries differences in seconds including the minimum, 1st quartile, median, mean,
3rd quartile, and maximum frequency. The timeseries differences give the user a better picture of
the index frequency so the user can understand the level of regularity or irregularity. A perfectly
regular time series will have equal values in seconds for each metric. However, this is not often the
case.
Important Note: These functions only work with time-based indexes in datetime, date, yearmon,
and yearqtr values. Regularized dates cannot be decomposed.
146 tk_get_timeseries_unit_frequency
Value
Returns a tibble object describing the timeseries.
See Also
tk_index(), tk_augment_timeseries_signature(), tk_make_future_timeseries()
Examples
library(dplyr)
library(lubridate)
library(zoo)
tk_get_timeseries_signature(FB_idx)
tk_get_timeseries_summary(FB_idx)
tk_get_timeseries_signature(idx_weekly)
tk_get_timeseries_summary(idx_weekly)
tk_get_timeseries_signature(idx_yearmon)
tk_get_timeseries_summary(idx_yearmon)
tk_get_timeseries_unit_frequency
Get the timeseries unit frequency for the primary time scales
Description
Get the timeseries unit frequency for the primary time scales
Usage
tk_get_timeseries_unit_frequency()
tk_get_timeseries_variables 147
Value
tk_get_timeseries_unit_frequency returns a tibble containing the timeseries frequencies in
seconds for the primary time scales including "sec", "min", "hour", "day", "week", "month", "quar-
ter", and "year".
Examples
tk_get_timeseries_unit_frequency()
tk_get_timeseries_variables
Get date or datetime variables (column names)
Description
Get date or datetime variables (column names)
Usage
tk_get_timeseries_variables(data)
Arguments
data An object of class data.frame
Details
tk_get_timeseries_variables returns the column names of date or datetime variables in a data
frame. Classes that meet criteria for return include those that inherit POSIXt, Date, zoo::yearmon,
zoo::yearqtr. Function was adapted from padr:::get_date_variables(). See padr helpers.R
Value
tk_get_timeseries_variables returns a vector containing column names of date-like classes.
Examples
library(dplyr)
FANG %>%
tk_get_timeseries_variables()
148 tk_index
tk_index Extract an index of date or datetime from time series objects, models,
forecasts
Description
Extract an index of date or datetime from time series objects, models, forecasts
Usage
tk_index(data, timetk_idx = FALSE, silent = FALSE)
has_timetk_idx(data)
Arguments
data A time-based tibble, time-series object, time-series model, or forecast object.
timetk_idx If timetk_idx is TRUE a timetk time-based index attribute is attempted to be
returned. If FALSE the default index is returned. See discussion below for further
details.
silent Used to toggle printing of messages and warnings.
Details
tk_index() is used to extract the date or datetime index from various time series objects, models
and forecasts. The method can be used on tbl, xts, zoo, zooreg, and ts objects. The method can
additionally be used on forecast objects and a number of objects generated by modeling functions
such as Arima, ets, and HoltWinters classes to get the index of the underlying data.
The boolean timetk_idx argument is applicable to regularized time series objects such as ts and
zooreg classes that have both a regularized index and potentially a "timetk index" (a time-based
attribute). When set to FALSE the regularized index is returned. When set to TRUE the time-based
timetk index is returned if present.
has_timetk_idx() is used to determine if the object has a "timetk index" attribute and can thus
benefit from the tk_index(timetk_idx = TRUE). TRUE indicates the "timetk index" attribute is
present. FALSE indicates the "timetk index" attribute is not present. If FALSE, the tk_index()
function will return the default index for the data type.
Important Note: To gain the benefit of timetk_idx the time series must have a timetk index. Use
has_timetk_idx to determine if the object has a timetk index. This is particularly important for
ts objects, which by default do not contain a time-based index and therefore must be coerced from
time-based objects such as tbl, xts, or zoo using the tk_ts() function in order to get the "timetk
index" attribute. Refer to tk_ts() for creating persistent date / datetime index during coercion to
ts.
Value
Returns a vector of date or date times
tk_make_future_timeseries 149
See Also
Examples
tk_make_future_timeseries
Make future time series from existing
Description
Usage
tk_make_future_timeseries(
idx,
length_out,
inspect_weekdays = FALSE,
inspect_months = FALSE,
skip_values = NULL,
insert_values = NULL,
n_future = NULL
)
150 tk_make_future_timeseries
Arguments
idx A vector of dates
length_out Number of future observations. Can be numeric number or a phrase like "1
year".
inspect_weekdays
Uses a logistic regression algorithm to inspect whether certain weekdays (e.g.
weekends) should be excluded from the future dates. Default is FALSE.
inspect_months Uses a logistic regression algorithm to inspect whether certain days of months
(e.g. last two weeks of year or seasonal days) should be excluded from the future
dates. Default is FALSE.
skip_values A vector of same class as idx of timeseries values to skip.
insert_values A vector of same class as idx of timeseries values to insert.
n_future (DEPRECATED) Number of future observations. Can be numeric number or a
phrase like "1 year".
Details
Future Sequences
tk_make_future_timeseries returns a time series based on the input index frequency and at-
tributes.
Specifying Length of Future Observations
The argument length_out determines how many future index observations to compute. It can be
specified as:
• The inspect_weekdays argument is useful in determining missing days of the week that oc-
cur on a weekly frequency such as every week, every other week, and so on. It’s recommended
to have at least 60 days to use this option.
• The inspect_months argument is useful in determining missing days of the month, quarter
or year; however, the algorithm can inadvertently select incorrect dates if the pattern is erratic.
tk_make_future_timeseries 151
Value
A vector containing future index of the same class as the incoming index idx
See Also
• Making Time Series: tk_make_timeseries()
• Working with Holidays & Weekends: tk_make_holiday_sequence(), tk_make_weekend_sequence(),
tk_make_weekday_sequence()
• Working with Timestamp Index: tk_index(), tk_get_timeseries_summary(), tk_get_timeseries_signature()
Examples
library(dplyr)
# Create index of days that FB stock will be traded in 2017 based on 2016 + holidays
FB_tbl <- FANG %>% dplyr::filter(symbol == "FB")
152 tk_make_holiday_sequence
# Remove holidays with skip_values, and remove weekends with inspect_weekdays = TRUE
FB_tbl %>%
tk_index() %>%
tk_make_future_timeseries(length_out = "1 year",
inspect_weekdays = TRUE,
skip_values = holidays)
tk_make_holiday_sequence
Make daily Holiday and Weekend date sequences
Description
Usage
tk_make_holiday_sequence(
start_date,
end_date,
calendar = c("NYSE", "LONDON", "NERC", "TSX", "ZURICH"),
skip_values = NULL,
insert_values = NULL
)
tk_make_weekend_sequence(start_date, end_date)
tk_make_weekday_sequence(
start_date,
end_date,
remove_weekends = TRUE,
remove_holidays = FALSE,
calendar = c("NYSE", "LONDON", "NERC", "TSX", "ZURICH"),
skip_values = NULL,
insert_values = NULL
)
tk_make_holiday_sequence 153
Arguments
start_date Used to define the starting date for date sequence generation. Provide in "YYYY-
MM-DD" format.
end_date Used to define the ending date for date sequence generation. Provide in "YYYY-
MM-DD" format.
calendar The calendar to be used in Date Sequence calculations for Holidays from the
timeDate package. Acceptable values are: "NYSE", "LONDON", "NERC", "TSX",
"ZURICH".
skip_values A daily date sequence to skip
insert_values A daily date sequence to insert
remove_weekends
A logical value indicating whether or not to remove weekends (Saturday and
Sunday) from the date sequence
remove_holidays
A logical value indicating whether or not to remove common holidays from the
date sequence
Details
Start and End Date Specification
• Accept shorthand notation (i.e. tk_make_timeseries() specifications apply)
• Only available in Daily Periods.
Holiday Sequences
tk_make_holiday_sequence() is a wrapper for various holiday calendars from the timeDate
package, making it easy to generate holiday sequences for common business calendars:
• New York Stock Exchange: calendar = "NYSE"
• Londo Stock Exchange: "LONDON"
• North American Reliability Council: "NERC"
• Toronto Stock Exchange: "TSX"
• Zurich Stock Exchange: "ZURICH"
Weekend and Weekday Sequences
These simply populate
Value
A vector containing future dates
See Also
• Intelligent date or date-time sequence creation: tk_make_timeseries()
• Holidays and weekends: tk_make_holiday_sequence(), tk_make_weekend_sequence(),
tk_make_weekday_sequence()
• Make future index from existing: tk_make_future_timeseries()
154 tk_make_timeseries
Examples
library(dplyr)
# Set max.print to 50
options_old <- options()$max.print
options(max.print = 50)
# Weekday Sequence
tk_make_weekday_sequence("2017", "2018", remove_holidays = TRUE)
# ---- COMBINE HOLIDAYS WITH MAKE FUTURE TIMESERIES FROM EXISTING ----
# - A common machine learning application is creating a future time series data set
# from an existing
# Create index of days that FB stock will be traded in 2017 based on 2016 + holidays
FB_tbl <- FANG %>% dplyr::filter(symbol == "FB")
options(max.print = options_old)
Description
Improves on the seq.Date() and seq.POSIXt() functions by simplifying into 1 function tk_make_timeseries().
Intelligently handles character dates and logical assumptions based on user inputs.
Usage
tk_make_timeseries(
start_date,
end_date,
by,
length_out = NULL,
include_endpoints = TRUE,
skip_values = NULL,
insert_values = NULL
)
Arguments
start_date Used to define the starting date for date sequence generation. Provide in "YYYY-
MM-DD" format.
end_date Used to define the ending date for date sequence generation. Provide in "YYYY-
MM-DD" format.
by A character string, containing one of "sec", "min", "hour", "day", "week",
"month", "quarter" or "year". You can create regularly spaced sequences
using phrases like by = "10 min". See Details.
length_out Optional length of the sequence. Can be used instead of one of: start_date,
end_date, or by. Can be specified as a number or a time-based phrase.
include_endpoints
Logical. Whether or not to keep the last value when length_out is a time-based
phrase. Default is TRUE (keep last value).
skip_values A sequence to skip
insert_values A sequence to insert
Details
The tk_make_timeseries() function handles both date and date-time sequences automatically.
Value
A vector containing date or date-times
See Also
• Intelligent date or date-time sequence creation: tk_make_timeseries()
• Holidays and weekends: tk_make_holiday_sequence(), tk_make_weekend_sequence(),
tk_make_weekday_sequence()
• Make future index from existing: tk_make_future_timeseries()
tk_make_timeseries 157
Examples
library(dplyr)
# Set max.print to 50
options_old <- options()$max.print
options(max.print = 50)
# Just Start
tk_make_timeseries("2017") # Same result
tk_make_timeseries(
"2011-01-01", length_out = 5,
skip_values = "2011-01-05",
insert_values = "2011-01-06"
)
options(max.print = options_old)
tk_seasonal_diagnostics
Group-wise Seasonality Data Preparation
Description
tk_seasonal_diagnostics() is the preprocessor for plot_seasonal_diagnostics(). It helps
by automating feature collection for time series seasonality analysis.
Usage
tk_seasonal_diagnostics(.data, .date_var, .value, .feature_set = "auto")
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
.feature_set One or multiple selections to analyze for seasonality. Choices include:
• "auto" - Automatically selects features based on the time stamps and length
of the series.
• "second" - Good for analyzing seasonality by second of each minute.
• "minute" - Good for analyzing seasonality by minute of the hour
• "hour" - Good for analyzing seasonality by hour of the day
• "wday.lbl" - Labeled weekdays. Good for analyzing seasonality by day of
the week.
• "week" - Good for analyzing seasonality by week of the year.
• "month.lbl" - Labeled months. Good for analyzing seasonality by month of
the year.
• "quarter" - Good for analyzing seasonality by quarter of the year
• "year" - Good for analyzing seasonality over multiple years.
tk_seasonal_diagnostics 159
Details
Automatic Feature Selection
Internal calculations are performed to detect a sub-range of features to include useing the following
logic:
• The minimum feature is selected based on the median difference between consecutive times-
tamps
• The maximum feature is selected based on having 2 full periods.
Example: Hourly timestamp data that lasts more than 2 weeks will have the following features:
"hour", "wday.lbl", and "week".
Scalable with Grouped Data Frames
This function respects grouped data.frame and tibbles that were made with dplyr::group_by().
For grouped data, the automatic feature selection returned is a collection of all features within the
sub-groups. This means extra features are returned even though they may be meaningless for some
of the groups.
Transformations
The .value parameter respects transformations (e.g. .value = log(sales)).
Value
A tibble or data.frame with seasonal features
Examples
library(dplyr)
# Hourly Data
m4_hourly %>%
group_by(id) %>%
tk_seasonal_diagnostics(date, value)
# Monthly Data
m4_monthly %>%
group_by(id) %>%
tk_seasonal_diagnostics(date, value)
m4_weekly %>%
group_by(id) %>%
tk_seasonal_diagnostics(date, log(value))
m4_hourly %>%
160 tk_stl_diagnostics
group_by(id) %>%
tk_seasonal_diagnostics(date, value, .feature_set = c("hour", "week"))
Description
tk_stl_diagnostics() is the preprocessor for plot_stl_diagnostics(). It helps by automating
frequency and trend selection.
Usage
tk_stl_diagnostics(
.data,
.date_var,
.value,
.frequency = "auto",
.trend = "auto",
.message = TRUE
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
.frequency Controls the seasonal adjustment (removal of seasonality). Input can be either
"auto", a time-based definition (e.g. "2 weeks"), or a numeric number of obser-
vations per frequency (e.g. 10). Refer to tk_get_frequency().
.trend Controls the trend component. For STL, trend controls the sensitivity of the
lowess smoother, which is used to remove the remainder.
.message A boolean. If TRUE, will output information related to automatic frequency and
trend selection (if applicable).
Details
The tk_stl_diagnostics() function generates a Seasonal-Trend-Loess decomposition. The func-
tion is "tidy" in the sense that it works on data frames and is designed to work with dplyr groups.
STL method:
The STL method implements time series decomposition using the underlying stats::stl(). The
decomposition separates the "season" and "trend" components from the "observed" values leaving
the "remainder".
tk_summary_diagnostics 161
1. The .frequency parameter adjusts the "season" component that is removed from the "ob-
served" values.
2. The .trend parameter adjusts the trend window (t.window parameter from stl()) that is
used.
The user may supply both .frequency and .trend as time-based durations (e.g. "6 weeks") or
numeric values (e.g. 180) or "auto", which automatically selects the frequency and/or trend based
on the scale of the time series.
Value
Examples
library(dplyr)
tk_summary_diagnostics
Group-wise Time Series Summary
Description
tk_summary_diagnostics() returns the time series summary from one or more timeseries groups
in a tibble.
Usage
tk_summary_diagnostics(.data, .date_var)
162 tk_tbl
Arguments
Details
Value
Examples
library(dplyr)
# Monthly Data
m4_monthly %>%
filter(id == "M750") %>%
tk_summary_diagnostics()
# Monthly Data
m4_monthly %>%
group_by(id) %>%
tk_summary_diagnostics()
Description
Usage
tk_tbl(
data,
preserve_index = TRUE,
rename_index = "index",
timetk_idx = FALSE,
silent = FALSE,
...
)
Arguments
data A time-series object.
preserve_index Attempts to preserve a time series index. Default is TRUE.
rename_index Enables the index column to be renamed.
timetk_idx Used to return a date / datetime index for regularized objects that contain a
timetk "index" attribute. Refer to tk_index() for more information on returning
index information from regularized timeseries objects (i.e. ts).
silent Used to toggle printing of messages and warnings.
... Additional parameters passed to the tibble::as_tibble() function.
Details
tk_tbl is designed to coerce time series objects (e.g. xts, zoo, ts, timeSeries, etc) to tibble
objects. The main advantage is that the function keeps the date / date-time information from the
underlying time-series object.
When preserve_index = TRUE is specified, a new column, index, is created during object coer-
cion, and the function attempts to preserve the date or date-time information. The date / date-time
column name can be changed using the rename_index argument.
The timetk_idx argument is applicable when coercing ts objects that were created using tk_ts()
from an object that had a time base (e.g. tbl, xts, zoo). Setting timetk_idx = TRUE enables
returning the timetk "index" attribute if present, which is the original (non-regularized) time-based
index.
Value
Returns a tibble object.
See Also
tk_xts(), tk_zoo(), tk_zooreg(), tk_ts()
Examples
library(dplyr)
# No index
as.data.frame(data_ts)
# Original date index returned (Only possible if original data has time-based index)
tk_tbl(data_ts, timetk_idx = TRUE)
# Dates are appropriate date class and within the data frame
tk_tbl(data_xts)
# Dates are appropriate zoo yearqtr class within the data frame
tk_tbl(data_zooreg)
# Dates are appropriate zoo yearmon class within the data frame
tk_tbl(data_zoo)
tk_time_series_cv_plan 165
tk_time_series_cv_plan
Time Series Resample Plan Data Preparation
Description
The tk_time_series_cv_plan() function provides a simple interface to prepare a time series
resample specification (rset) of either rolling_origin or time_series_cv class.
Usage
tk_time_series_cv_plan(.data)
Arguments
.data A time series resample specification of of either rolling_origin or time_series_cv
class.
Details
Resample Set
A resample set is an output of the timetk::time_series_cv() function or the rsample::rolling_origin()
function.
Value
A tibble containing the time series crossvalidation plan.
See Also
• time_series_cv() and rsample::rolling_origin() - Functions used to create time series
resample specfications.
• plot_time_series_cv_plan() - The plotting function used for visualizing the time series
resample plan.
Examples
library(dplyr)
library(rsample)
tk_ts Coerce time series objects and tibbles with date/date-time columns to
ts.
Description
Coerce time series objects and tibbles with date/date-time columns to ts.
Usage
tk_ts(
data,
select = NULL,
start = 1,
end = numeric(),
frequency = 1,
deltat = 1,
ts.eps = getOption("ts.eps"),
silent = FALSE
)
tk_ts_(
data,
select = NULL,
start = 1,
end = numeric(),
frequency = 1,
deltat = 1,
ts.eps = getOption("ts.eps"),
silent = FALSE
)
Arguments
data A time-based tibble or time-series object.
select Applicable to tibbles and data frames only. The column or set of columns to
be coerced to ts class.
start the time of the first observation. Either a single number or a vector of two
numbers (the second of which is an integer), which specify a natural time unit
and a (1-based) number of samples into the time unit. See the examples for the
use of the second form.
end the time of the last observation, specified in the same way as start.
tk_ts 167
Details
tk_ts() is a wrapper for stats::ts() that is designed to coerce tibble objects that have a "time-
base" (meaning the values vary with time) to ts class objects. There are two main advantages:
1. Non-numeric columns get removed instead of being populated by NA’s.
2. The returned ts object retains a "timetk index" (and various other attributes) if detected. The
"timetk index" can be used to coerce between tbl, xts, zoo, and ts data types.
The select argument is used to select subsets of columns from the incoming data.frame. Only
columns containing numeric data are coerced. At a minimum, a frequency and a start should be
specified.
For non-data.frame object classes (e.g. xts, zoo, timeSeries, etc) the objects are coerced using
stats::ts().
tk_ts_ is a nonstandard evaluation method.
Value
Returns a ts object.
See Also
tk_index(), tk_tbl(), tk_xts(), tk_zoo(), tk_zooreg()
Examples
library(dplyr)
# as.ts: Character columns introduce NA's; Result does not retain index
stats::ts(data_tbl[,-1], start = 2016)
# tk_ts: Only numeric columns get coerced; Result retains index in numeric format
data_ts <- tk_ts(data_tbl, start = 2016)
data_ts
# timetk index
168 tk_tsfeatures
Description
tk_tsfeatures() is a tidyverse compliant wrapper for tsfeatures::tsfeatures(). The func-
tion computes a matrix of time series features that describes the various time series. It’s designed
for groupwise analysis using dplyr groups.
Usage
tk_tsfeatures(
.data,
.date_var,
.value,
.period = "auto",
.features = c("frequency", "stl_features", "entropy", "acf_features"),
.scale = TRUE,
.trim = FALSE,
.trim_amount = 0.1,
.parallel = FALSE,
.na_action = na.pass,
.prefix = "ts_",
.silent = TRUE,
...
)
Arguments
.data A tibble or data.frame with a time-based column
.date_var A column containing either date or date-time values
.value A column containing numeric values
tk_tsfeatures 169
.period The periodicity (frequency) of the time series data. Values can be provided as
follows:
• "auto" (default) Calculates using tk_get_frequency().
• "2 weeks": Would calculate the median number of observations in a 2-week
window.
• 7 (numeric): Would interpret the ts frequency as 7 observations per cycle
(common for weekly data)
.features Passed to features in the underlying tsfeatures() function. A vector of
function names that represent a feature aggregation function. Examples:
1. Use one of the function names from tsfeatures R package e.g.("lumpiness",
"stl_features").
2. Use a function name (e.g. "mean" or "median")
3. Create your own function and provide the function name
.scale If TRUE, time series are scaled to mean 0 and sd 1 before features are computed.
.trim If TRUE, time series are trimmed by trim_amount before features are computed.
Values larger than trim_amount in absolute value are set to NA.
.trim_amount Default level of trimming if trim==TRUE. Default: 0.1.
.parallel If TRUE, multiple cores (or multiple sessions) will be used. This only speeds
things up when there are a large number of time series.
When .parallel = TRUE, the multiprocess = future::multisession. This
can be adjusted by setting multiprocess parameter. See the tsfeatures::tsfeatures()
function for mor details.
.na_action A function to handle missing values. Use na.interp to estimate missing values.
.prefix A prefix to prefix the feature columns. Default: "ts_".
.silent Whether or not to show messages and warnings.
... Other arguments get passed to the feature functions.
Details
The timetk::tk_tsfeatures() function implements the tsfeatures package for computing ag-
gregated feature matrix for time series that is useful in many types of analysis such as clustering
time series.
The timetk version ports the tsfeatures::tsfeatures() function to a tidyverse-compliant
format that uses a tidy data frame containing grouping columns (optional), a date column, and a
value column. Other columns are ignored.
It then becomes easy to summarize each time series by group-wise application of .features, which
are simply functions that evaluate a time series and return single aggregated value. (Example:
"mean" would return the mean of the time series (note that values are scaled to mean 1 and sd 0
first))
Function Internals:
Internally, the time series are converted to ts class using tk_ts(.period) where the period is the
frequency of the time series. Values can be provided for .period, which will be used prior to
convertion to ts class.
The function then leverages tsfeatures::tsfeatures() to compute the feature matrix of sum-
marized feature values.
170 tk_xts
Value
A tibble or data.frame with aggregated features that describe each time series.
References
1. Rob Hyndman, Yanfei Kang, Pablo Montero-Manso, Thiyanga Talagala, Earo Wang, Yangzhuo-
ran Yang, Mitchell O’Hara-Wild: tsfeatures R package
Examples
library(dplyr)
walmart_sales_weekly %>%
group_by(id) %>%
tk_tsfeatures(
.date_var = Date,
.value = Weekly_Sales,
.period = 52,
.features = c("frequency", "stl_features", "entropy", "acf_features", "mean"),
.scale = TRUE,
.prefix = "ts_"
)
tk_xts Coerce time series objects and tibbles with date/date-time columns to
xts.
Description
Coerce time series objects and tibbles with date/date-time columns to xts.
Usage
tk_xts(data, select = NULL, date_var = NULL, silent = FALSE, ...)
Arguments
data A time-based tibble or time-series object.
select Applicable to tibbles and data frames only. The column or set of columns to
be coerced to ts class.
date_var Applicable to tibbles and data frames only. Column name to be used to
order.by. NULL by default. If NULL, function will find the date or date-time
column.
silent Used to toggle printing of messages and warnings.
... Additional parameters to be passed to xts::xts(). Refer to xts::xts().
tk_xts 171
Details
tk_xts is a wrapper for xts::xts() that is designed to coerce tibble objects that have a "time-
base" (meaning the values vary with time) to xts class objects. There are three main advantages:
1. Non-numeric columns that are not removed via select are dropped and the user is warned.
This prevents an error or coercion issue from occurring.
2. The date column is auto-detected if not specified by date_var. This takes the effort off the
user to assign a date vector during coercion.
3. ts objects are automatically coerced if a "timetk index" is present. Refer to tk_ts().
The select argument can be used to select subsets of columns from the incoming data.frame. Only
columns containing numeric data are coerced. The date_var can be used to specify the column
with the date index. If date_var = NULL, the date / date-time column is interpreted. Optionally, the
order.by argument from the underlying xts::xts() function can be used. The user must pass a
vector of dates or date-times if order.by is used.
For non-data.frame object classes (e.g. xts, zoo, timeSeries, etc) the objects are coerced using
xts::xts().
tk_xts_ is a nonstandard evaluation method.
Value
Returns a xts object.
See Also
tk_tbl(), tk_zoo(), tk_zooreg(), tk_ts()
Examples
library(dplyr)
# xts: Character columns cause coercion issues; order.by must be passed a vector of dates
xts::xts(data_tbl[,-1], order.by = data_tbl$date)
tk_zoo Coerce time series objects and tibbles with date/date-time columns to
xts.
Description
Coerce time series objects and tibbles with date/date-time columns to xts.
Usage
tk_zoo(data, select = NULL, date_var = NULL, silent = FALSE, ...)
Arguments
data A time-based tibble or time-series object.
select Applicable to tibbles and data frames only. The column or set of columns to
be coerced to ts class.
date_var Applicable to tibbles and data frames only. Column name to be used to
order.by. NULL by default. If NULL, function will find the date or date-time
column.
silent Used to toggle printing of messages and warnings.
... Additional parameters to be passed to xts::xts(). Refer to xts::xts().
Details
tk_zoo is a wrapper for zoo::zoo() that is designed to coerce tibble objects that have a "time-
base" (meaning the values vary with time) to zoo class objects. There are three main advantages:
1. Non-numeric columns that are not removed via select are dropped and the user is warned.
This prevents an error or coercion issue from occurring.
2. The date column is auto-detected if not specified by date_var. This takes the effort off the
user to assign a date vector during coercion.
3. ts objects are automatically coerced if a "timetk index" is present. Refer to tk_ts().
tk_zoo 173
The select argument can be used to select subsets of columns from the incoming data.frame. Only
columns containing numeric data are coerced. The date_var can be used to specify the column
with the date index. If date_var = NULL, the date / date-time column is interpreted. Optionally, the
order.by argument from the underlying zoo::zoo() function can be used. The user must pass a
vector of dates or date-times if order.by is used. Important Note: The ... arguments are passed
to xts::xts(), which enables additional information (e.g. time zone) to be an attribute of the zoo
object.
For non-data.frame object classes (e.g. xts, zoo, timeSeries, etc) the objects are coerced using
zoo::zoo().
tk_zoo_ is a nonstandard evaluation method.
Value
Returns a zoo object.
See Also
tk_tbl(), tk_xts(), tk_zooreg(), tk_ts()
Examples
library(dplyr)
# zoo: Characters will cause error; order.by must be passed a vector of dates
zoo::zoo(data_tbl[,-c(1,2)], order.by = data_tbl$date)
# tk_zoo: Character columns dropped with a warning; No need to specify dates (auto detected)
tk_zoo(data_tbl)
tk_zooreg Coerce time series objects and tibbles with date/date-time columns to
ts.
Description
Coerce time series objects and tibbles with date/date-time columns to ts.
Usage
tk_zooreg(
data,
select = NULL,
date_var = NULL,
start = 1,
end = numeric(),
frequency = 1,
deltat = 1,
ts.eps = getOption("ts.eps"),
order.by = NULL,
silent = FALSE
)
tk_zooreg_(
data,
select = NULL,
date_var = NULL,
start = 1,
end = numeric(),
frequency = 1,
deltat = 1,
ts.eps = getOption("ts.eps"),
order.by = NULL,
silent = FALSE
)
Arguments
data A time-based tibble or time-series object.
select Applicable to tibbles and data frames only. The column or set of columns to
be coerced to zooreg class.
date_var Applicable to tibbles and data frames only. Column name to be used to
order.by. NULL by default. If NULL, function will find the date or date-time
column.
start the time of the first observation. Either a single number or a vector of two
integers, which specify a natural time unit and a (1-based) number of samples
into the time unit.
tk_zooreg 175
end the time of the last observation, specified in the same way as start.
frequency the number of observations per unit of time.
deltat the fraction of the sampling period between successive observations; e.g., 1/12
for monthly data. Only one of frequency or deltat should be provided.
ts.eps time series comparison tolerance. Frequencies are considered equal if their ab-
solute difference is less than ts.eps.
order.by a vector by which the observations in x are ordered. If this is specified the
arguments start and end are ignored and zoo(data, order.by, frequency)
is called. See zoo for more information.
silent Used to toggle printing of messages and warnings.
Details
tk_zooreg() is a wrapper for zoo::zooreg() that is designed to coerce tibble objects that have
a "time-base" (meaning the values vary with time) to zooreg class objects. There are two main
advantages:
The select argument is used to select subsets of columns from the incoming data.frame. The
date_var can be used to specify the column with the date index. If date_var = NULL, the date /
date-time column is interpreted. Optionally, the order.by argument from the underlying xts::xts()
function can be used. The user must pass a vector of dates or date-times if order.by is used. Only
columns containing numeric data are coerced. At a minimum, a frequency and a start should be
specified.
For non-data.frame object classes (e.g. xts, zoo, timeSeries, etc) the objects are coerced using
zoo::zooreg().
tk_zooreg_ is a nonstandard evaluation method.
Value
Returns a zooreg object.
See Also
tk_tbl(), tk_xts(), tk_zoo(), tk_ts()
Examples
### tibble to zooreg: Comparison between tk_zooreg() and zoo::zooreg()
data_tbl <- tibble::tibble(
date = seq.Date(as.Date("2016-01-01"), by = 1, length.out = 5),
x = rep("chr values", 5),
y = cumsum(1:5),
z = cumsum(11:15) * rnorm(1))
# tk_zooreg: Only numeric columns get coerced; Result retains index as rownames
data_tk_zooreg <- tk_zooreg(data_tbl, start = 2016, freq = 365)
data_tk_zooreg # No inadvertent coercion to character class
# timetk index
tk_index(data_tk_zooreg, timetk_idx = FALSE) # Regularized index returned
tk_index(data_tk_zooreg, timetk_idx = TRUE) # Original date index returned
Description
This is mainly a wrapper for the outlier cleaning function, tsclean(), from the forecast R pack-
age. The ts_clean_vec() function includes arguments for applying seasonality to numeric vector
(non-ts) via the period argument.
Usage
Arguments
x A numeric vector.
period A seasonal period to use during the transformation. If period = 1, seasonality
is not included and supsmu() is used to fit a trend. If period > 1, a robust
STL decomposition is first performed and a linear interpolation is applied to the
seasonally adjusted data.
lambda A box cox transformation parameter. If set to "auto", performs automated
lambda selection.
ts_clean_vec 177
Details
Cleaning Outliers
To estimate missing values and outlier replacements, linear interpolation is used on the (possibly
seasonally adjusted) series. See forecast::tsoutliers() for the outlier detection method.
Box Cox Transformation
In many circumstances, a Box Cox transformation can help. Especially if the series is multiplicative
meaning the variance grows exponentially. A Box Cox transformation can be automated by setting
lambda = "auto" or can be specified by setting lambda = numeric value.
Value
A numeric vector with the missing values and/or anomalies transformed to imputed values.
References
• Forecast R Package
• Forecasting Principles & Practices: Dealing with missing values and outliers
See Also
• Box Cox Transformation: box_cox_vec()
• Lag Transformation: lag_vec()
• Differencing Transformation: diff_vec()
• Rolling Window Transformation: slidify_vec()
• Loess Smoothing Transformation: smooth_vec()
• Fourier Series: fourier_vec()
• Missing Value Imputation for Time Series: ts_impute_vec()
• Outlier Cleaning for Time Series: ts_clean_vec()
Examples
library(dplyr)
Description
This is mainly a wrapper for the Seasonally Adjusted Missing Value using Linear Interpolation
function, na.interp(), from the forecast R package. The ts_impute_vec() function includes
arguments for applying seasonality to numeric vector (non-ts) via the period argument.
Usage
ts_impute_vec(x, period = 1, lambda = NULL)
Arguments
x A numeric vector.
period A seasonal period to use during the transformation. If period = 1, linear in-
terpolation is performed. If period > 1, a robust STL decomposition is first
performed and a linear interpolation is applied to the seasonally adjusted data.
lambda A box cox transformation parameter. If set to "auto", performs automated
lambda selection.
Details
Imputation using Linear Interpolation
Three circumstances cause strictly linear interpolation:
1. Period is 1: With period = 1, a seasonality cannot be interpreted and therefore linear is used.
2. Number of Non-Missing Values is less than 2-Periods: Insufficient values exist to detect
seasonality.
3. Number of Total Values is less than 3-Periods: Insufficient values exist to detect seasonality.
Seasonal Imputation using Linear Interpolation
For seasonal series with period > 1, a robust Seasonal Trend Loess (STL) decomposition is first
computed. Then a linear interpolation is applied to the seasonally adjusted data, and the seasonal
component is added back.
Box Cox Transformation
In many circumstances, a Box Cox transformation can help. Especially if the series is multiplicative
meaning the variance grows exponentially. A Box Cox transformation can be automated by setting
lambda = "auto" or can be specified by setting lambda = numeric value.
walmart_sales_weekly 179
Value
References
• Forecast R Package
• Forecasting Principles & Practices: Dealing with missing values and outliers
See Also
Examples
library(dplyr)
# Linear interpolation
ts_impute_vec(values, period = 1, lambda = NULL)
walmart_sales_weekly Sample Time Series Retail Data from the Walmart Recruiting Store
Sales Forecasting Competition
180 walmart_sales_weekly
Description
The Kaggle "Walmart Recruiting - Store Sales Forecasting" Competition used retail data for com-
binations of stores and departments within each store. The competition began February 20th, 2014
and ended May 5th, 2014. The competition included data from 45 retail stores located in different
regions. The dataset included various external features including Holiday information, Tempera-
ture, Fuel Price, and Markdown. This dataset includes a Sample of 7 departments from the Store
ID 1 (7 total time series).
Usage
walmart_sales_weekly
Format
A tibble: 9,743 x 3
• id Factor. Unique series identifier (4 total)
• Store Numeric. Store ID.
• Dept Numeric. Department ID.
• Date Date. Weekly timestamp.
• Weekly_Sales Numeric. Sales for the given department in the given store.
• IsHoliday Logical. Whether the week is a "special" holiday for the store.
• Type Character. Type identifier of the store.
• Size Numeric. Store square-footage
• Temperature Numeric. Average temperature in the region.
• Fuel_Price Numeric. Cost of fuel in the region.
• MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5 Numeric. Anonymized data
related to promotional markdowns that Walmart is running. MarkDown data is only available
after Nov 2011, and is not available for all stores all the time. Any missing value is marked
with an NA.
• CPI Numeric. The consumer price index.
• Unemployment Numeric. The unemployment rate in the region.
Details
This is a sample of 7 Weekly data sets from the Kaggle Walmart Recruiting Store Sales Forecasting
competition.
Holiday Features
The four holidays fall within the following weeks in the dataset (not all holidays are in the data):
• Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
• Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
• Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
• Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
wikipedia_traffic_daily 181
Source
• Kaggle Competition Website
Examples
walmart_sales_weekly
wikipedia_traffic_daily
Sample Daily Time Series Data from the Web Traffic Forecasting
(Wikipedia) Competition
Description
The Kaggle "Web Traffic Forecasting" (Wikipedia) Competition used Google Analytics Web Traf-
fic Data for 145,000 websites. Each of these time series represent a number of daily views of a
different Wikipedia articles. The competition began July 13th, 2017 and ended November 15th,
2017. This dataset includes a Sample of 10 article pages (10 total time series).
Usage
wikipedia_traffic_daily
Format
A tibble: 9,743 x 3
Details
This is a sample of 10 Daily data sets from the Kaggle Web Traffic Forecasting (Wikipedia) Com-
petition
Source
• Kaggle Competition Website
Examples
wikipedia_traffic_daily
Index
∗ datagen step_holiday_signature, 91
step_fourier, 88 %+time% (time_arithmetic), 120
step_holiday_signature, 91 %-time% (time_arithmetic), 120
step_slidify, 96
step_slidify_augment, 99 add_time (time_arithmetic), 120
step_smooth, 103 anomalize, 4
∗ datasets anydate(), 40
bike_sharing_daily, 9 anytime(), 40
FANG, 16 auto_lambda (box_cox_vec), 10
m4_daily, 29 between_time, 7
m4_hourly, 30 between_time(), 8, 13, 18, 20, 35, 39, 72,
m4_monthly, 31 118, 121
m4_quarterly, 32 bike_sharing_daily, 9
m4_weekly, 32 box_cox_inv_vec (box_cox_vec), 10
m4_yearly, 33 box_cox_vec, 10
taylor_30_min, 119 box_cox_vec(), 11, 15, 22, 27, 29, 36, 78, 80,
walmart_sales_weekly, 179 82, 177, 179
wikipedia_traffic_daily, 181
∗ dates condense_period, 12
step_fourier, 88 condense_period(), 8, 13, 18, 20, 35, 39, 72,
step_holiday_signature, 91 118
step_ts_pad, 114 cor(), 118
∗ model_specification cov(), 118
step_fourier, 88
diff_inv_vec (diff_vec), 14
step_holiday_signature, 91
diff_vec, 14
step_ts_pad, 114
diff_vec(), 11, 15, 22, 27, 29, 36, 78, 81, 82,
∗ moving_windows
133, 177, 179
step_slidify, 96
dplyr::mutate(), 73
step_slidify_augment, 99
step_smooth, 103 FANG, 16
∗ preprocessing filter_by_time, 17
step_fourier, 88 filter_by_time(), 7, 8, 13, 18–20, 35, 39,
step_holiday_signature, 91 72, 118
step_slidify, 96 filter_period, 19
step_slidify_augment, 99 filter_period(), 8, 13, 17, 18, 20, 35, 39,
step_smooth, 103 72, 118
step_ts_pad, 114 fourier_vec, 20
∗ variable_encodings fourier_vec(), 11, 15, 22, 27, 29, 37, 78, 81,
step_fourier, 88 82, 134, 177, 179
182
INDEX 183
future_frame, 23 plot_anomaly_diagnostics, 48
plot_anomaly_diagnostics(), 131
get_tk_time_scale_template plot_seasonal_diagnostics, 52
(set_tk_time_scale_template), plot_seasonal_diagnostics(), 44, 129
69 plot_stl_diagnostics, 55
get_tk_time_scale_template(), 142 plot_time_series, 57
plot_time_series(), 44, 66, 68, 129
has_timetk_idx (tk_index), 148 plot_time_series_boxplot, 61
plot_time_series_cv_plan, 66
is_date_class, 25
plot_time_series_cv_plan(), 67, 124, 165
lag_vec, 26 plot_time_series_regression, 68
lag_vec(), 11, 15, 22, 27, 29, 36, 78, 81, 82, purrr::map(), 73
137, 177, 179
recipes::selections(), 88, 91, 92, 97, 100,
lead_vec (lag_vec), 26
103, 107, 115
log_interval_inv_vec
recipes::step_lag(), 27, 87
(log_interval_vec), 28
recipes::step_naomit(), 87
log_interval_vec, 28
recipes::step_normalize(), 107
log_interval_vec(), 95
recipes::step_rm(), 89, 92, 107
lubridate::period(), 121
rsample::rolling_origin(), 67, 124, 127,
m4_daily, 29 165
m4_hourly, 30
m4_monthly, 31 sd(), 118
m4_quarterly, 32 selections(), 84, 86, 94, 109, 112
m4_weekly, 32 set_tk_time_scale_template, 69
m4_yearly, 33 set_tk_time_scale_template(), 142
max(), 118 slice_period, 71
mean(), 118 slice_period(), 8, 13, 18, 20, 35, 39, 72
median(), 118 slidify, 72
min(), 118 slidify(), 8, 13, 18, 20, 35, 39, 72, 78, 118
mutate_by_time, 34 slidify_vec, 76
mutate_by_time(), 8, 13, 18, 20, 35, 39, 72, slidify_vec(), 11, 15, 22, 27, 29, 36, 74, 78,
118 81, 82, 139, 177, 179
smooth_vec, 79
normalize_inv_vec (normalize_vec), 36 smooth_vec(), 11, 15, 22, 27, 29, 37, 59, 63,
normalize_vec, 36 66, 77, 78, 81, 82, 177, 179
normalize_vec(), 36, 82 standardize_inv_vec (standardize_vec),
82
pad_by_time, 37 standardize_vec, 82
pad_by_time(), 8, 13, 18, 20, 35, 39, 72, 118 standardize_vec(), 36, 82
parse_date2, 40 stats::lm(), 68
parse_datetime2 (parse_date2), 40 stats::stl(), 6, 51, 56, 130, 160
plot_acf_diagnostics, 41 step_box_cox, 83
plot_acf_diagnostics(), 44, 127, 129 step_box_cox(), 85, 87, 90, 93, 98, 102, 105,
plot_anomalies, 45 108, 111, 113, 116
plot_anomalies_cleaned step_diff, 86
(plot_anomalies), 45 step_diff(), 15, 85, 87, 90, 93, 95, 98, 102,
plot_anomalies_decomp (plot_anomalies), 105, 108, 111, 113, 116
45 step_fourier, 88
184 INDEX
step_fourier(), 22, 85, 87, 90, 93, 95, 98, tidy.step_smooth (step_smooth), 103
102, 104, 108, 111, 113, 116 tidy.step_timeseries_signature
step_holiday_signature, 91 (step_timeseries_signature),
step_holiday_signature(), 85, 87, 90, 93, 106
95, 98, 102, 104, 108, 111, 113, 116, tidy.step_ts_clean (step_ts_clean), 109
144 tidy.step_ts_impute (step_ts_impute),
step_log_interval, 94 111
step_log_interval(), 95 tidy.step_ts_pad (step_ts_pad), 114
step_naomit(), 86 time_arithmetic, 120
step_slidify, 96 time_series_cv, 122, 126
step_slidify(), 78, 85, 87, 90, 93, 95, 98, time_series_cv(), 67, 124, 125, 127, 165
102, 105, 108, 111, 113, 116 time_series_split, 125
step_slidify_augment, 99 time_series_split(), 124
step_smooth, 103 timetk (timetk-package), 4
step_smooth(), 80, 85, 87, 90, 93, 95, 98, timetk-package, 4
101, 102, 105, 108, 111, 113, 116 tk_acf_diagnostics, 127
step_timeseries_signature, 106 tk_anomaly_diagnostics, 129
step_timeseries_signature(), 85, 87, 90, tk_anomaly_diagnostics(), 52
93, 95, 98, 102, 104, 108, 111, 113, tk_augment_differences, 132
116 tk_augment_differences(), 15, 133, 134,
step_ts_clean, 109 136, 137, 139, 141
step_ts_clean(), 85, 87, 90, 93, 95, 98, 102, tk_augment_fourier, 133
105, 108, 111, 113, 116
tk_augment_fourier(), 22, 133, 134, 136,
step_ts_impute, 111
137, 139, 141
step_ts_impute(), 85, 87, 90, 93, 95, 98,
tk_augment_holiday, 134
102, 105, 108, 111, 113, 116
tk_augment_holiday_signature
step_ts_pad, 114
(tk_augment_holiday), 134
step_ts_pad(), 85, 87, 90, 93, 95, 98, 102,
tk_augment_holiday_signature(),
105, 108, 111, 113, 116
133–135, 137, 139, 141, 144
subtract_time (time_arithmetic), 120
tk_augment_lags, 136
sum(), 118
tk_augment_lags(), 27, 133, 134, 136, 137,
summarise_by_time, 117
139, 141
summarise_by_time(), 8, 13, 18, 20, 35, 39,
tk_augment_leads (tk_augment_lags), 136
72, 118
summarize_by_time (summarise_by_time), tk_augment_slidify, 138
117 tk_augment_slidify(), 74, 78, 133, 134,
136, 137, 139, 141
taylor_30_min, 119 tk_augment_timeseries, 140
tibble::as_tibble(), 163 tk_augment_timeseries_signature
tidy.step_box_cox (step_box_cox), 83 (tk_augment_timeseries), 140
tidy.step_diff (step_diff), 86 tk_augment_timeseries_signature(),
tidy.step_fourier (step_fourier), 88 133–135, 137, 139, 141, 146
tidy.step_holiday_signature tk_get_frequency, 141
(step_holiday_signature), 91 tk_get_frequency(), 5, 50, 55, 70, 130, 160
tidy.step_log_interval tk_get_holiday, 143
(step_log_interval), 94 tk_get_holiday_signature
tidy.step_slidify (step_slidify), 96 (tk_get_holiday), 143
tidy.step_slidify_augment tk_get_holiday_signature(), 135, 136
(step_slidify_augment), 99 tk_get_holidays_by_year
INDEX 185