Automated feature engineering is a cornerstone of the package. Below are some of the techniques we use in multivariate machine learning models, and the outside packages that make it possible.
Missing data is filled in using the pad_by_time
function from the timetk package. First, each time series is grouped and
padded using their existing start and end dates. Missing values are
padded using NA. Then the same process is ran again, this time padding
data from the hist_start_date
from
forecast_time_series()
, with missing values being filled in
with zero. This ensures that missing data before a time series starts
are all zeroes, but missing periods within the existing time series data
are identified to be inputted with new values in the next step.
After missing data is padded, the ts_impute_vec
function from the timetk package is called to impute any NA values. This
only happens if the clean_missing_values
input from
forecast_time_series()
is set to TRUE, otherwise NA values
are replaced with zero.
Outliers are handled using the ts_clean_vec
function from the timetk package. Outliers are replaced after the
missing data process, and only runs if the clean_outliers
input from forecast_time_series()
is set to TRUE.
Important Note: Missing values and outliers are replaced for the target variable and any numeric external regressors.
Stabilizes the variance in each time series using the box_cox_vec
function from the timetk package. Applied to both the target variable
and any external regressor before other transformations like
differencing. You can control this within
prep_models()
.
Uses the feasts
package to check if each time series is stationary and applies the
differencing required (up to two standard differences with lag one) in
order to make the time series stationary. Uses the diff_vec
function from the timetk package to do the differencing. This is applied
to the target variable and any external regressor before other features
are created. Data is undifferenced before training for univariate models
like arima, but differenced data is used for all multivariate models.
You can control the differencing done within
prep_models()
.
The tk_augment_timeseries_signature function from the timetk package easily extracts out various date features from the time stamp. The function doesn’t differentiate between date type, so features need to be removed depending on the date type. For example, features related to week and day for a monthly forecast are automatically removed.
Fourier series are also added using the tk_augment_fourier function from timetk.
library(dplyr)
library(timetk)
m4_monthly %>%
timetk::tk_augment_timeseries_signature(date) %>%
dplyr::group_by(id) %>%
timetk::tk_augment_fourier(date, .periods = c(3, 6, 12), .K = 1) %>%
dplyr::ungroup()
#> # A tibble: 1,574 x 37
#> id date value index.num diff year year.iso half quarter month
#> <fct> <date> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
#> 1 M1 1976-06-01 8000 202435200 NA 1976 1976 1 2 6
#> 2 M1 1976-07-01 8350 205027200 2592000 1976 1976 2 3 7
#> 3 M1 1976-08-01 8570 207705600 2678400 1976 1976 2 3 8
#> 4 M1 1976-09-01 7700 210384000 2678400 1976 1976 2 3 9
#> 5 M1 1976-10-01 7080 212976000 2592000 1976 1976 2 4 10
#> 6 M1 1976-11-01 6520 215654400 2678400 1976 1976 2 4 11
#> 7 M1 1976-12-01 6070 218246400 2592000 1976 1976 2 4 12
#> 8 M1 1977-01-01 6650 220924800 2678400 1977 1976 1 1 1
#> 9 M1 1977-02-01 6830 223603200 2678400 1977 1977 1 1 2
#> 10 M1 1977-03-01 5710 226022400 2419200 1977 1977 1 1 3
#> # i 1,564 more rows
#> # i 27 more variables: month.xts <int>, month.lbl <ord>, day <int>, hour <int>,
#> # minute <int>, second <int>, hour12 <int>, am.pm <int>, wday <int>,
#> # wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>, yday <int>,
#> # mweek <int>, week <int>, week.iso <int>, week2 <int>, week3 <int>,
#> # week4 <int>, mday7 <int>, date_sin3_K1 <dbl>, date_cos3_K1 <dbl>,
#> # date_sin6_K1 <dbl>, date_cos6_K1 <dbl>, date_sin12_K1 <dbl>, ...
Lags of the target variable and external regressors are created using the tk_augment_lags function from timetk.
Rolling window calculations of the target variable are created using the tk_augment_slidify function from timetk. The below calculations are created over various window values.
Polynomial transformations are created for the target variable, and lags are then created on top of them. The below transformations are created.
In addition to the standard approaches above, finnts also does two different ways of preparing features to be created for a multivariate machine learning model.
In the first recipe, referred to as “R1” in default finnts models, all of the engineered target and external regressor features are used but cannot be less than the forecast horizon. For example, a monthly data set with a forecast horizon of 3, finnts will take engineered features like lags and rolling window features but only use those one that are for periods equal to or greater than 3. Recursive forecasting is not supported in default finnts multivariate machine learning models, since feeding forecast outputs as features to create another forecast adds complex layers of uncertainty that can easily spiral out of control and produce poor forecasts. NA values created by generating lag features are filled “up”. This results in the first initial periods of a time series having some data leakage but the effect should be small if the time series is long enough.
library(finnts)
hist_data <- timetk::m4_monthly %>%
dplyr::filter(date >= "2012-01-01",
id == "M2") %>%
dplyr::rename(Date = date) %>%
dplyr::mutate(id = as.character(id))
run_info <- set_run_info(
experiment_name = "finnts_fcst",
run_name = "R1_run"
)
prep_data(run_info = run_info,
input_data = hist_data,
combo_variables = c("id"),
target_variable = "value",
date_type = "month",
forecast_horizon = 3,
recipes_to_run = "R1")
R1_prepped_data_tbl <- get_prepped_data(run_info = run_info,
recipe = "R1")
print(R1_prepped_data_tbl)
#> # A tibble: 45 x 79
#> Date Combo id Target Date_index.num Date_diff Date_year Date_half
#> <date> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2012-01-01 M2 M2 680794. 1325376000 0 2012 1
#> 2 2012-02-01 M2 M2 680794. 1328054400 2678400 2012 1
#> 3 2012-03-01 M2 M2 869475. 1330560000 2505600 2012 1
#> 4 2012-04-01 M2 M2 -1244450. 1333238400 2678400 2012 1
#> 5 2012-05-01 M2 M2 1214019. 1335830400 2592000 2012 1
#> 6 2012-06-01 M2 M2 184088. 1338508800 2678400 2012 1
#> 7 2012-07-01 M2 M2 -482907. 1341100800 2592000 2012 2
#> 8 2012-08-01 M2 M2 -116729. 1343779200 2678400 2012 2
#> 9 2012-09-01 M2 M2 -340595. 1346457600 2678400 2012 2
#> 10 2012-10-01 M2 M2 311662. 1349049600 2592000 2012 2
#> # i 35 more rows
#> # i 71 more variables: Date_quarter <dbl>, Date_month <dbl>,
#> # Date_month.lbl <chr>, Target_lag3 <dbl>, Target_lag6 <dbl>,
#> # Target_lag9 <dbl>, Target_lag12 <dbl>, Target_lag3_roll3_Avg <dbl>,
#> # Target_lag6_roll3_Avg <dbl>, Target_lag9_roll3_Avg <dbl>,
#> # Target_lag12_roll3_Avg <dbl>, Target_lag3_roll6_Avg <dbl>,
#> # Target_lag6_roll6_Avg <dbl>, Target_lag9_roll6_Avg <dbl>, ...
The second recipe is referred to as “R2” in default finnts models. It takes a very different approach than the “R1” recipe. For a 3 month forecast horizon on a monthly dataset, target and rolling window features are created depending on the horizon period. They are also constrained to be equal or less than the forecast horizon. In the below example, “Origin” and “Horizon” features are created for each time period. This results in duplicating rows in the original data set to create new features that are now specific to each horizon period. This helps the default finnts models find new unique relationships to model, when compared to a more formal approach in “R1”. NA values created by generating lag features are filled “up”.
library(finnts)
hist_data <- timetk::m4_monthly %>%
dplyr::filter(date >= "2012-01-01",
id == "M2") %>%
dplyr::rename(Date = date) %>%
dplyr::mutate(id = as.character(id))
run_info <- set_run_info(
experiment_name = "finnts_fcst",
run_name = "R2_run"
)
prep_data(run_info = run_info,
input_data = hist_data,
combo_variables = c("id"),
target_variable = "value",
date_type = "month",
forecast_horizon = 3,
recipes_to_run = "R2")
R2_prepped_data_tbl <- get_prepped_data(run_info = run_info,
recipe = "R2")
print(R2_prepped_data_tbl)
#> # A tibble: 135 x 107
#> Date Combo id Target Date_index.num Date_diff Date_year Date_half
#> <date> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2012-01-01 M2 M2 680794. 1325376000 0 2012 1
#> 2 2012-02-01 M2 M2 680794. 1328054400 2678400 2012 1
#> 3 2012-03-01 M2 M2 869475. 1330560000 2505600 2012 1
#> 4 2012-04-01 M2 M2 -1244450. 1333238400 2678400 2012 1
#> 5 2012-05-01 M2 M2 1214019. 1335830400 2592000 2012 1
#> 6 2012-06-01 M2 M2 184088. 1338508800 2678400 2012 1
#> 7 2012-07-01 M2 M2 -482907. 1341100800 2592000 2012 2
#> 8 2012-08-01 M2 M2 -116729. 1343779200 2678400 2012 2
#> 9 2012-09-01 M2 M2 -340595. 1346457600 2678400 2012 2
#> 10 2012-10-01 M2 M2 311662. 1349049600 2592000 2012 2
#> # i 125 more rows
#> # i 99 more variables: Date_quarter <dbl>, Date_month <dbl>,
#> # Date_month.lbl <chr>, Horizon <dbl>, Origin <dbl>, Target_lag1 <dbl>,
#> # Target_lag2 <dbl>, Target_lag3 <dbl>, Target_lag6 <dbl>, Target_lag9 <dbl>,
#> # Target_lag12 <dbl>, Target_lag1_roll3_Avg <dbl>,
#> # Target_lag2_roll3_Avg <dbl>, Target_lag3_roll3_Avg <dbl>,
#> # Target_lag6_roll3_Avg <dbl>, Target_lag9_roll3_Avg <dbl>, ...
In addition to everything called out above, some models have their own specific transformations that need to be applied before training a model. For example, the “glmnet” model needs to transform categorical variables into continuous variables and center/scale the data before training. Each default model in finnts has their own preprocessing steps that ensure the data fed into the model has the best chance of producing a high quality forecast. The recipes package is used to easily apply various preprocessing transformations needed before training a model.