How to calculate the mean:
tourism %>%
features(Trips, list(mean = mean)) %>%
arrange(mean)
## # A tibble: 304 x 4
## Region State Purpose mean
## <chr> <chr> <chr> <dbl>
## 1 Kangaroo Island South Australia Other 0.340
## 2 MacDonnell Northern Territory Other 0.449
## 3 Wilderness West Tasmania Other 0.478
## 4 Barkly Northern Territory Other 0.632
## 5 Clare Valley South Australia Other 0.898
## 6 Barossa South Australia Other 1.02
## 7 Kakadu Arnhem Northern Territory Other 1.04
## 8 Lasseter Northern Territory Other 1.14
## 9 Wimmera Victoria Other 1.15
## 10 MacDonnell Northern Territory Visiting 1.18
## # ... with 294 more rows
How to calculate quantiles:
tourism %>% features(Trips, quantile)
## # A tibble: 304 x 8
## Region State Purpose `0%` `25%` `50%` `75%` `100%`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adelaide South Australia Busine~ 68.7 134. 153. 177. 242.
## 2 Adelaide South Australia Holiday 108. 135. 154. 172. 224.
## 3 Adelaide South Australia Other 25.9 43.9 53.8 62.5 107.
## 4 Adelaide South Australia Visiti~ 137. 179. 206. 229. 270.
## 5 Adelaide Hills South Australia Busine~ 0 0 1.26 3.92 28.6
## 6 Adelaide Hills South Australia Holiday 0 5.77 8.52 14.1 35.8
## 7 Adelaide Hills South Australia Other 0 0 0.908 2.09 8.95
## 8 Adelaide Hills South Australia Visiti~ 0.778 8.91 12.2 16.8 81.1
## 9 Alice Springs Northern Territo~ Busine~ 1.01 9.13 13.3 18.5 34.1
## 10 Alice Springs Northern Territo~ Holiday 2.81 16.9 31.5 44.8 76.5
## # ... with 294 more rows
For this the min is labeled 0% and max is labeled 100%. Weird.
feat_acf() computes some of the autocorrelations above
tourism %>% features(Trips, feat_acf)
## # A tibble: 304 x 10
## Region State Purpose acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adelaide South Aus~ Busine~ 0.0333 0.131 -0.520 0.463 -0.676
## 2 Adelaide South Aus~ Holiday 0.0456 0.372 -0.343 0.614 -0.487
## 3 Adelaide South Aus~ Other 0.517 1.15 -0.409 0.383 -0.675
## 4 Adelaide South Aus~ Visiti~ 0.0684 0.294 -0.394 0.452 -0.518
## 5 Adelaide~ South Aus~ Busine~ 0.0709 0.134 -0.580 0.415 -0.750
## 6 Adelaide~ South Aus~ Holiday 0.131 0.313 -0.536 0.500 -0.716
## 7 Adelaide~ South Aus~ Other 0.261 0.330 -0.253 0.317 -0.457
## 8 Adelaide~ South Aus~ Visiti~ 0.139 0.117 -0.472 0.239 -0.626
## 9 Alice Sp~ Northern ~ Busine~ 0.217 0.367 -0.500 0.381 -0.658
## 10 Alice Sp~ Northern ~ Holiday -0.00660 2.11 -0.153 2.11 -0.274
## # ... with 294 more rows, and 2 more variables: diff2_acf10 <dbl>,
## # season_acf1 <dbl>
\[F_T = max\left(0,1 - \frac{Var(R_t)}{Var(T_t + R_t)} \right)\]
\[F_S = max\left(0,1 - \frac{Var(S_t)}{Var(T_t + R_t)} \right)\]
tourism %>%
features(Trips, feat_stl)
## # A tibble: 304 x 12
## Region State Purpose trend_strength seasonal_strengt~ seasonal_peak_y~
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Adelaide South Au~ Busine~ 0.451 0.380 3
## 2 Adelaide South Au~ Holiday 0.541 0.601 1
## 3 Adelaide South Au~ Other 0.743 0.189 2
## 4 Adelaide South Au~ Visiti~ 0.433 0.446 1
## 5 Adelaide~ South Au~ Busine~ 0.453 0.140 3
## 6 Adelaide~ South Au~ Holiday 0.512 0.244 2
## 7 Adelaide~ South Au~ Other 0.584 0.374 2
## 8 Adelaide~ South Au~ Visiti~ 0.481 0.228 0
## 9 Alice Sp~ Northern~ Busine~ 0.526 0.224 0
## 10 Alice Sp~ Northern~ Holiday 0.377 0.827 3
## # ... with 294 more rows, and 6 more variables: seasonal_trough_year <dbl>,
## # spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
## # stl_e_acf10 <dbl>
To identify which series are heavily trended and which are heavily seasonal:
tourism %>%
features(Trips, feat_stl) %>%
ggplot(aes(x = trend_strength,
y = seasonal_strength_year,
col = Purpose)) +
geom_point() +
facet_wrap(vars(State))
Here we can see holiday travel is, of course, most seasonal while strongest trends are in Victoria and W Australia
Just a bullet point list of features of feasts
All features of the feasts package can be computed like this:
tourism_features <- tourism %>%
features(Trips, feature_set(pkgs = 'feasts'))
## Warning: `n_flat_spots()` was deprecated in feasts 0.1.5.
## Please use `longest_flat_spot()` instead.
tourism_features
## # A tibble: 304 x 51
## Region State Purpose trend_strength seasonal_strengt~ seasonal_peak_y~
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Adelaide South Au~ Busine~ 0.451 0.380 3
## 2 Adelaide South Au~ Holiday 0.541 0.601 1
## 3 Adelaide South Au~ Other 0.743 0.189 2
## 4 Adelaide South Au~ Visiti~ 0.433 0.446 1
## 5 Adelaide~ South Au~ Busine~ 0.453 0.140 3
## 6 Adelaide~ South Au~ Holiday 0.512 0.244 2
## 7 Adelaide~ South Au~ Other 0.584 0.374 2
## 8 Adelaide~ South Au~ Visiti~ 0.481 0.228 0
## 9 Alice Sp~ Northern~ Busine~ 0.526 0.224 0
## 10 Alice Sp~ Northern~ Holiday 0.377 0.827 3
## # ... with 294 more rows, and 45 more variables: seasonal_trough_year <dbl>,
## # spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
## # stl_e_acf10 <dbl>, acf1 <dbl>, acf10 <dbl>, diff1_acf1 <dbl>,
## # diff1_acf10 <dbl>, diff2_acf1 <dbl>, diff2_acf10 <dbl>, season_acf1 <dbl>,
## # pacf5 <dbl>, diff1_pacf5 <dbl>, diff2_pacf5 <dbl>, season_pacf <dbl>,
## # zero_run_mean <dbl>, nonzero_squared_cv <dbl>, zero_start_prop <dbl>,
## # zero_end_prop <dbl>, lambda_guerrero <dbl>, kpss_stat <dbl>,
## # kpss_pvalue <dbl>, pp_stat <dbl>, pp_pvalue <dbl>, ndiffs <int>,
## # nsdiffs <int>, bp_stat <dbl>, bp_pvalue <dbl>, lb_stat <dbl>,
## # lb_pvalue <dbl>, var_tiled_var <dbl>, var_tiled_mean <dbl>,
## # shift_level_max <dbl>, shift_level_index <dbl>, shift_var_max <dbl>,
## # shift_var_index <dbl>, shift_kl_max <dbl>, shift_kl_index <dbl>,
## # spectral_entropy <dbl>, n_crossing_points <int>, longest_flat_spot <int>,
## # coef_hurst <dbl>, stat_arch_lm <dbl>
tourism_features %>%
select_at(vars(contains('season'), Purpose)) %>%
mutate(
seasonal_peak_year = seasonal_peak_year +
4*(seasonal_peak_year==0),
seasonal_trough_year = seasonal_trough_year +
4*(seasonal_trough_year==0),
seasonal_peak_year = glue('Q{seasonal_peak_year}'),
seasonal_trough_year = glue('Q{seasonal_trough_year}')
) %>%
GGally::ggpairs(mapping = aes(colour = Purpose), progress=FALSE)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notable observations:
seasonal_peak_year and seasonal_trough_year columns show that seasonal peaks in Business travel occur most often in Q3 and least often in Q1PCA This is where they barely introduce the subject of PCA. Better to learn the details from somewhere else.
library(broom)
pca <- tourism_features %>%
select(-State, -Region, -Purpose) %>%
prcomp(scale = TRUE) %>%
augment(tourism_features)
pca %>%
ggplot(aes(x = .fittedPC1, y=.fittedPC2, col=Purpose)) +
geom_point() +
theme(aspect.ratio = 1)
The above graph shows
Inspecting the anomalies:
outliers <- pca %>%
filter(.fittedPC1 > 10) %>%
select(Region, State, Purpose, .fittedPC1, .fittedPC2)
outliers
## # A tibble: 4 x 5
## Region State Purpose .fittedPC1 .fittedPC2
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Australia's North West Western Australia Business 13.4 -11.4
## 2 Australia's South West Western Australia Holiday 10.9 0.857
## 3 Melbourne Victoria Holiday 12.3 -10.5
## 4 South Coast New South Wales Holiday 11.9 9.40
outliers %>%
left_join(tourism, by = c('State', 'Region', 'Purpose')) %>%
mutate(
Series = glue("{State}", "{Region}", "{Purpose}", .sep = "\n\n")
) %>%
ggplot(aes(x = Quarter, y = Trips)) +
geom_line() +
facet_grid(Series ~ ., scales = "free") +
labs(title = "Time Series of PCA Outliers")
Spculation as to why these are outliers: