Chapter 04 Notes

4.1 Simple Stats

How to calculate the mean:

tourism %>%
  features(Trips, list(mean = mean)) %>%
  arrange(mean)

## # A tibble: 304 x 4
##    Region          State              Purpose   mean
##    <chr>           <chr>              <chr>    <dbl>
##  1 Kangaroo Island South Australia    Other    0.340
##  2 MacDonnell      Northern Territory Other    0.449
##  3 Wilderness West Tasmania           Other    0.478
##  4 Barkly          Northern Territory Other    0.632
##  5 Clare Valley    South Australia    Other    0.898
##  6 Barossa         South Australia    Other    1.02 
##  7 Kakadu Arnhem   Northern Territory Other    1.04 
##  8 Lasseter        Northern Territory Other    1.14 
##  9 Wimmera         Victoria           Other    1.15 
## 10 MacDonnell      Northern Territory Visiting 1.18 
## # ... with 294 more rows

How to calculate quantiles:

tourism %>% features(Trips, quantile)

## # A tibble: 304 x 8
##    Region         State             Purpose    `0%`  `25%`   `50%`  `75%` `100%`
##    <chr>          <chr>             <chr>     <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
##  1 Adelaide       South Australia   Busine~  68.7   134.   153.    177.   242.  
##  2 Adelaide       South Australia   Holiday 108.    135.   154.    172.   224.  
##  3 Adelaide       South Australia   Other    25.9    43.9   53.8    62.5  107.  
##  4 Adelaide       South Australia   Visiti~ 137.    179.   206.    229.   270.  
##  5 Adelaide Hills South Australia   Busine~   0       0      1.26    3.92  28.6 
##  6 Adelaide Hills South Australia   Holiday   0       5.77   8.52   14.1   35.8 
##  7 Adelaide Hills South Australia   Other     0       0      0.908   2.09   8.95
##  8 Adelaide Hills South Australia   Visiti~   0.778   8.91  12.2    16.8   81.1 
##  9 Alice Springs  Northern Territo~ Busine~   1.01    9.13  13.3    18.5   34.1 
## 10 Alice Springs  Northern Territo~ Holiday   2.81   16.9   31.5    44.8   76.5 
## # ... with 294 more rows

For this the min is labeled 0% and max is labeled 100%. Weird.

4.2 ACF Features

Sum of the first ten squared autocorrelation coefficients is use as a summary for how much autocorrelation there is in a series
You can also use autocorrelation on differenced data and seasonally differenced data (more in future chapters)

feat_acf() computes some of the autocorrelations above

tourism %>% features(Trips, feat_acf)

## # A tibble: 304 x 10
##    Region    State      Purpose     acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1
##    <chr>     <chr>      <chr>      <dbl> <dbl>      <dbl>       <dbl>      <dbl>
##  1 Adelaide  South Aus~ Busine~  0.0333  0.131     -0.520       0.463     -0.676
##  2 Adelaide  South Aus~ Holiday  0.0456  0.372     -0.343       0.614     -0.487
##  3 Adelaide  South Aus~ Other    0.517   1.15      -0.409       0.383     -0.675
##  4 Adelaide  South Aus~ Visiti~  0.0684  0.294     -0.394       0.452     -0.518
##  5 Adelaide~ South Aus~ Busine~  0.0709  0.134     -0.580       0.415     -0.750
##  6 Adelaide~ South Aus~ Holiday  0.131   0.313     -0.536       0.500     -0.716
##  7 Adelaide~ South Aus~ Other    0.261   0.330     -0.253       0.317     -0.457
##  8 Adelaide~ South Aus~ Visiti~  0.139   0.117     -0.472       0.239     -0.626
##  9 Alice Sp~ Northern ~ Busine~  0.217   0.367     -0.500       0.381     -0.658
## 10 Alice Sp~ Northern ~ Holiday -0.00660 2.11      -0.153       2.11      -0.274
## # ... with 294 more rows, and 2 more variables: diff2_acf10 <dbl>,
## #   season_acf1 <dbl>

4.3 STL Features

For strongly trended data, seasonally adjusted should have much more variation than the remainder. For data with little to no trend, variances should be about the same.
Calculate trend strength as follows:

\[F_T = max\left(0,1 - \frac{Var(R_t)}{Var(T_t + R_t)} \right)\]

calculate seasonal strength as follows:

\[F_S = max\left(0,1 - \frac{Var(S_t)}{Var(T_t + R_t)} \right)\]

A series with strong trend or strong seasonality would be close to 1 where a series with no seasonality or trend would be close to 0.

tourism %>%
  features(Trips, feat_stl)

## # A tibble: 304 x 12
##    Region    State     Purpose trend_strength seasonal_strengt~ seasonal_peak_y~
##    <chr>     <chr>     <chr>            <dbl>             <dbl>            <dbl>
##  1 Adelaide  South Au~ Busine~          0.451             0.380                3
##  2 Adelaide  South Au~ Holiday          0.541             0.601                1
##  3 Adelaide  South Au~ Other            0.743             0.189                2
##  4 Adelaide  South Au~ Visiti~          0.433             0.446                1
##  5 Adelaide~ South Au~ Busine~          0.453             0.140                3
##  6 Adelaide~ South Au~ Holiday          0.512             0.244                2
##  7 Adelaide~ South Au~ Other            0.584             0.374                2
##  8 Adelaide~ South Au~ Visiti~          0.481             0.228                0
##  9 Alice Sp~ Northern~ Busine~          0.526             0.224                0
## 10 Alice Sp~ Northern~ Holiday          0.377             0.827                3
## # ... with 294 more rows, and 6 more variables: seasonal_trough_year <dbl>,
## #   spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
## #   stl_e_acf10 <dbl>

To identify which series are heavily trended and which are heavily seasonal:

tourism %>%
  features(Trips, feat_stl) %>%
  ggplot(aes(x = trend_strength,
             y = seasonal_strength_year,
             col = Purpose)) +
  geom_point() +
  facet_wrap(vars(State))

Here we can see holiday travel is, of course, most seasonal while strongest trends are in Victoria and W Australia

4.4 Other Features

Just a bullet point list of features of feasts

4.5 Exploring the previous list of bullets

All features of the feasts package can be computed like this:

tourism_features <- tourism %>%
  features(Trips, feature_set(pkgs = 'feasts'))

## Warning: `n_flat_spots()` was deprecated in feasts 0.1.5.
## Please use `longest_flat_spot()` instead.

tourism_features

## # A tibble: 304 x 51
##    Region    State     Purpose trend_strength seasonal_strengt~ seasonal_peak_y~
##    <chr>     <chr>     <chr>            <dbl>             <dbl>            <dbl>
##  1 Adelaide  South Au~ Busine~          0.451             0.380                3
##  2 Adelaide  South Au~ Holiday          0.541             0.601                1
##  3 Adelaide  South Au~ Other            0.743             0.189                2
##  4 Adelaide  South Au~ Visiti~          0.433             0.446                1
##  5 Adelaide~ South Au~ Busine~          0.453             0.140                3
##  6 Adelaide~ South Au~ Holiday          0.512             0.244                2
##  7 Adelaide~ South Au~ Other            0.584             0.374                2
##  8 Adelaide~ South Au~ Visiti~          0.481             0.228                0
##  9 Alice Sp~ Northern~ Busine~          0.526             0.224                0
## 10 Alice Sp~ Northern~ Holiday          0.377             0.827                3
## # ... with 294 more rows, and 45 more variables: seasonal_trough_year <dbl>,
## #   spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
## #   stl_e_acf10 <dbl>, acf1 <dbl>, acf10 <dbl>, diff1_acf1 <dbl>,
## #   diff1_acf10 <dbl>, diff2_acf1 <dbl>, diff2_acf10 <dbl>, season_acf1 <dbl>,
## #   pacf5 <dbl>, diff1_pacf5 <dbl>, diff2_pacf5 <dbl>, season_pacf <dbl>,
## #   zero_run_mean <dbl>, nonzero_squared_cv <dbl>, zero_start_prop <dbl>,
## #   zero_end_prop <dbl>, lambda_guerrero <dbl>, kpss_stat <dbl>,
## #   kpss_pvalue <dbl>, pp_stat <dbl>, pp_pvalue <dbl>, ndiffs <int>,
## #   nsdiffs <int>, bp_stat <dbl>, bp_pvalue <dbl>, lb_stat <dbl>,
## #   lb_pvalue <dbl>, var_tiled_var <dbl>, var_tiled_mean <dbl>,
## #   shift_level_max <dbl>, shift_level_index <dbl>, shift_var_max <dbl>,
## #   shift_var_index <dbl>, shift_kl_max <dbl>, shift_kl_index <dbl>,
## #   spectral_entropy <dbl>, n_crossing_points <int>, longest_flat_spot <int>,
## #   coef_hurst <dbl>, stat_arch_lm <dbl>

tourism_features %>%
  select_at(vars(contains('season'), Purpose)) %>%
  mutate(
    seasonal_peak_year = seasonal_peak_year +
      4*(seasonal_peak_year==0),
    seasonal_trough_year = seasonal_trough_year +
      4*(seasonal_trough_year==0),
    seasonal_peak_year = glue('Q{seasonal_peak_year}'),
    seasonal_trough_year = glue('Q{seasonal_trough_year}')
  ) %>%
  GGally::ggpairs(mapping = aes(colour = Purpose), progress=FALSE)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notable observations:

The 3 numerical measures related to seasonality are all positively correlated
The bottom left panel and top right panel both show that the most strongly seasonal series is Holiday
Bar plots in the bottom row of seasonal_peak_year and seasonal_trough_year columns show that seasonal peaks in Business travel occur most often in Q3 and least often in Q1

PCA This is where they barely introduce the subject of PCA. Better to learn the details from somewhere else.

library(broom)

pca <- tourism_features %>%
  select(-State, -Region, -Purpose) %>%
  prcomp(scale = TRUE) %>%
  augment(tourism_features)

pca %>%
  ggplot(aes(x = .fittedPC1, y=.fittedPC2, col=Purpose)) +
  geom_point() +
  theme(aspect.ratio = 1)

The above graph shows

The holiday series behaves differently from the others
- The second principle component is distinguishing between holiday and other
There are also 4 anomaly points > 10 on the x-axis

Inspecting the anomalies:

outliers <- pca %>%
  filter(.fittedPC1 > 10) %>%
  select(Region, State, Purpose, .fittedPC1, .fittedPC2)

outliers

## # A tibble: 4 x 5
##   Region                 State             Purpose  .fittedPC1 .fittedPC2
##   <chr>                  <chr>             <chr>         <dbl>      <dbl>
## 1 Australia's North West Western Australia Business       13.4    -11.4  
## 2 Australia's South West Western Australia Holiday        10.9      0.857
## 3 Melbourne              Victoria          Holiday        12.3    -10.5  
## 4 South Coast            New South Wales   Holiday        11.9      9.40

outliers %>%
  left_join(tourism, by = c('State', 'Region', 'Purpose')) %>%
  mutate(
    Series = glue("{State}", "{Region}", "{Purpose}", .sep = "\n\n") 
  ) %>%
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line() +
  facet_grid(Series ~ ., scales = "free") +
  labs(title = "Time Series of PCA Outliers")

Spculation as to why these are outliers:

NSW South Coast is highly seasonal but shows no trend whereas most holiday destinations show a trend over time
While Melbourne has a trend as a holiday destination, it has almost no seasonality
North West Corner of Western Australia shows an increase in business tourism over the past couple years but no apparent seasonality
The SW corner of Western Australia is unusual as it shows an increase in holiday tourism in just the past couple of years and a high level of seasonality