4.1 Simple Stats

How to calculate the mean:

tourism %>%
  features(Trips, list(mean = mean)) %>%
  arrange(mean)
## # A tibble: 304 x 4
##    Region          State              Purpose   mean
##    <chr>           <chr>              <chr>    <dbl>
##  1 Kangaroo Island South Australia    Other    0.340
##  2 MacDonnell      Northern Territory Other    0.449
##  3 Wilderness West Tasmania           Other    0.478
##  4 Barkly          Northern Territory Other    0.632
##  5 Clare Valley    South Australia    Other    0.898
##  6 Barossa         South Australia    Other    1.02 
##  7 Kakadu Arnhem   Northern Territory Other    1.04 
##  8 Lasseter        Northern Territory Other    1.14 
##  9 Wimmera         Victoria           Other    1.15 
## 10 MacDonnell      Northern Territory Visiting 1.18 
## # ... with 294 more rows

How to calculate quantiles:

tourism %>% features(Trips, quantile)
## # A tibble: 304 x 8
##    Region         State             Purpose    `0%`  `25%`   `50%`  `75%` `100%`
##    <chr>          <chr>             <chr>     <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
##  1 Adelaide       South Australia   Busine~  68.7   134.   153.    177.   242.  
##  2 Adelaide       South Australia   Holiday 108.    135.   154.    172.   224.  
##  3 Adelaide       South Australia   Other    25.9    43.9   53.8    62.5  107.  
##  4 Adelaide       South Australia   Visiti~ 137.    179.   206.    229.   270.  
##  5 Adelaide Hills South Australia   Busine~   0       0      1.26    3.92  28.6 
##  6 Adelaide Hills South Australia   Holiday   0       5.77   8.52   14.1   35.8 
##  7 Adelaide Hills South Australia   Other     0       0      0.908   2.09   8.95
##  8 Adelaide Hills South Australia   Visiti~   0.778   8.91  12.2    16.8   81.1 
##  9 Alice Springs  Northern Territo~ Busine~   1.01    9.13  13.3    18.5   34.1 
## 10 Alice Springs  Northern Territo~ Holiday   2.81   16.9   31.5    44.8   76.5 
## # ... with 294 more rows

For this the min is labeled 0% and max is labeled 100%. Weird.


4.2 ACF Features

feat_acf() computes some of the autocorrelations above

tourism %>% features(Trips, feat_acf)
## # A tibble: 304 x 10
##    Region    State      Purpose     acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1
##    <chr>     <chr>      <chr>      <dbl> <dbl>      <dbl>       <dbl>      <dbl>
##  1 Adelaide  South Aus~ Busine~  0.0333  0.131     -0.520       0.463     -0.676
##  2 Adelaide  South Aus~ Holiday  0.0456  0.372     -0.343       0.614     -0.487
##  3 Adelaide  South Aus~ Other    0.517   1.15      -0.409       0.383     -0.675
##  4 Adelaide  South Aus~ Visiti~  0.0684  0.294     -0.394       0.452     -0.518
##  5 Adelaide~ South Aus~ Busine~  0.0709  0.134     -0.580       0.415     -0.750
##  6 Adelaide~ South Aus~ Holiday  0.131   0.313     -0.536       0.500     -0.716
##  7 Adelaide~ South Aus~ Other    0.261   0.330     -0.253       0.317     -0.457
##  8 Adelaide~ South Aus~ Visiti~  0.139   0.117     -0.472       0.239     -0.626
##  9 Alice Sp~ Northern ~ Busine~  0.217   0.367     -0.500       0.381     -0.658
## 10 Alice Sp~ Northern ~ Holiday -0.00660 2.11      -0.153       2.11      -0.274
## # ... with 294 more rows, and 2 more variables: diff2_acf10 <dbl>,
## #   season_acf1 <dbl>

4.3 STL Features

\[F_T = max\left(0,1 - \frac{Var(R_t)}{Var(T_t + R_t)} \right)\]

\[F_S = max\left(0,1 - \frac{Var(S_t)}{Var(T_t + R_t)} \right)\]

tourism %>%
  features(Trips, feat_stl)
## # A tibble: 304 x 12
##    Region    State     Purpose trend_strength seasonal_strengt~ seasonal_peak_y~
##    <chr>     <chr>     <chr>            <dbl>             <dbl>            <dbl>
##  1 Adelaide  South Au~ Busine~          0.451             0.380                3
##  2 Adelaide  South Au~ Holiday          0.541             0.601                1
##  3 Adelaide  South Au~ Other            0.743             0.189                2
##  4 Adelaide  South Au~ Visiti~          0.433             0.446                1
##  5 Adelaide~ South Au~ Busine~          0.453             0.140                3
##  6 Adelaide~ South Au~ Holiday          0.512             0.244                2
##  7 Adelaide~ South Au~ Other            0.584             0.374                2
##  8 Adelaide~ South Au~ Visiti~          0.481             0.228                0
##  9 Alice Sp~ Northern~ Busine~          0.526             0.224                0
## 10 Alice Sp~ Northern~ Holiday          0.377             0.827                3
## # ... with 294 more rows, and 6 more variables: seasonal_trough_year <dbl>,
## #   spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
## #   stl_e_acf10 <dbl>

To identify which series are heavily trended and which are heavily seasonal:

tourism %>%
  features(Trips, feat_stl) %>%
  ggplot(aes(x = trend_strength,
             y = seasonal_strength_year,
             col = Purpose)) +
  geom_point() +
  facet_wrap(vars(State))

Here we can see holiday travel is, of course, most seasonal while strongest trends are in Victoria and W Australia


4.4 Other Features

Just a bullet point list of features of feasts


4.5 Exploring the previous list of bullets

All features of the feasts package can be computed like this:

tourism_features <- tourism %>%
  features(Trips, feature_set(pkgs = 'feasts'))
## Warning: `n_flat_spots()` was deprecated in feasts 0.1.5.
## Please use `longest_flat_spot()` instead.
tourism_features
## # A tibble: 304 x 51
##    Region    State     Purpose trend_strength seasonal_strengt~ seasonal_peak_y~
##    <chr>     <chr>     <chr>            <dbl>             <dbl>            <dbl>
##  1 Adelaide  South Au~ Busine~          0.451             0.380                3
##  2 Adelaide  South Au~ Holiday          0.541             0.601                1
##  3 Adelaide  South Au~ Other            0.743             0.189                2
##  4 Adelaide  South Au~ Visiti~          0.433             0.446                1
##  5 Adelaide~ South Au~ Busine~          0.453             0.140                3
##  6 Adelaide~ South Au~ Holiday          0.512             0.244                2
##  7 Adelaide~ South Au~ Other            0.584             0.374                2
##  8 Adelaide~ South Au~ Visiti~          0.481             0.228                0
##  9 Alice Sp~ Northern~ Busine~          0.526             0.224                0
## 10 Alice Sp~ Northern~ Holiday          0.377             0.827                3
## # ... with 294 more rows, and 45 more variables: seasonal_trough_year <dbl>,
## #   spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
## #   stl_e_acf10 <dbl>, acf1 <dbl>, acf10 <dbl>, diff1_acf1 <dbl>,
## #   diff1_acf10 <dbl>, diff2_acf1 <dbl>, diff2_acf10 <dbl>, season_acf1 <dbl>,
## #   pacf5 <dbl>, diff1_pacf5 <dbl>, diff2_pacf5 <dbl>, season_pacf <dbl>,
## #   zero_run_mean <dbl>, nonzero_squared_cv <dbl>, zero_start_prop <dbl>,
## #   zero_end_prop <dbl>, lambda_guerrero <dbl>, kpss_stat <dbl>,
## #   kpss_pvalue <dbl>, pp_stat <dbl>, pp_pvalue <dbl>, ndiffs <int>,
## #   nsdiffs <int>, bp_stat <dbl>, bp_pvalue <dbl>, lb_stat <dbl>,
## #   lb_pvalue <dbl>, var_tiled_var <dbl>, var_tiled_mean <dbl>,
## #   shift_level_max <dbl>, shift_level_index <dbl>, shift_var_max <dbl>,
## #   shift_var_index <dbl>, shift_kl_max <dbl>, shift_kl_index <dbl>,
## #   spectral_entropy <dbl>, n_crossing_points <int>, longest_flat_spot <int>,
## #   coef_hurst <dbl>, stat_arch_lm <dbl>
tourism_features %>%
  select_at(vars(contains('season'), Purpose)) %>%
  mutate(
    seasonal_peak_year = seasonal_peak_year +
      4*(seasonal_peak_year==0),
    seasonal_trough_year = seasonal_trough_year +
      4*(seasonal_trough_year==0),
    seasonal_peak_year = glue('Q{seasonal_peak_year}'),
    seasonal_trough_year = glue('Q{seasonal_trough_year}')
  ) %>%
  GGally::ggpairs(mapping = aes(colour = Purpose), progress=FALSE)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notable observations:

PCA This is where they barely introduce the subject of PCA. Better to learn the details from somewhere else.

library(broom)

pca <- tourism_features %>%
  select(-State, -Region, -Purpose) %>%
  prcomp(scale = TRUE) %>%
  augment(tourism_features)

pca %>%
  ggplot(aes(x = .fittedPC1, y=.fittedPC2, col=Purpose)) +
  geom_point() +
  theme(aspect.ratio = 1)

The above graph shows

Inspecting the anomalies:

outliers <- pca %>%
  filter(.fittedPC1 > 10) %>%
  select(Region, State, Purpose, .fittedPC1, .fittedPC2)

outliers
## # A tibble: 4 x 5
##   Region                 State             Purpose  .fittedPC1 .fittedPC2
##   <chr>                  <chr>             <chr>         <dbl>      <dbl>
## 1 Australia's North West Western Australia Business       13.4    -11.4  
## 2 Australia's South West Western Australia Holiday        10.9      0.857
## 3 Melbourne              Victoria          Holiday        12.3    -10.5  
## 4 South Coast            New South Wales   Holiday        11.9      9.40
outliers %>%
  left_join(tourism, by = c('State', 'Region', 'Purpose')) %>%
  mutate(
    Series = glue("{State}", "{Region}", "{Purpose}", .sep = "\n\n") 
  ) %>%
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line() +
  facet_grid(Series ~ ., scales = "free") +
  labs(title = "Time Series of PCA Outliers")

Spculation as to why these are outliers: