Time series features

with scaled cases

Author

Raphael Saldanha

Last modification

December 1, 2023 | 09:07:18 +01:00

This notebook aims to explore time series features of dengue cases that may guide the clustering procedures. Time series features descriptions are quoted from Hyndman et al. (2022) .

Packages

library(tidyverse)
library(tidymodels)
library(arrow)
library(tsfeatures)
library(broom)
library(DT)
source("../functions.R")

Functions

Perform Kolmogorov-Smirnorf tests between groups statistics.

Code

ks_group_test <- function(stat){
  
  tsf_group_split <- tsf_group %>%
    # Select variables and statistic
    select(group, statistic = !!stat) %>%
    # Split to list  
    group_split(group) 

  # Matrix of possible combinations
  comb <- combn(x = unique(tsf_group$group), m = 2)

  # Resuls data frame
  ks_results <- tibble()
  
  
  # For each group combination, perform ks.test
  for(i in 1:ncol(comb)){
    g_a <- comb[1,i]
    g_b <- comb[2,i]
    
    res <- ks.test(
      x = tsf_group_split[[g_a]]$statistic, 
      y =  tsf_group_split[[g_b]]$statistic
    ) %>% tidy()
    
    tmp <- tibble(
      g_a = g_a,
      g_b = g_b,
      statistic = round(res$statistic, 4),
      pvalue = round(res$p.value, 4)
    )
    
    ks_results <- bind_rows(ks_results, tmp)
  }
  
  ks_results %>%
    arrange(g_a, g_b)
}

Load data

Load the bundled data (679 municipalities, pop $\geq$ 50k inhab.) with standardized cases and keep only the municipality code, date and cases variables.

tdengue <- open_dataset(sources = data_dir("bundled_data/tdengue.parquet")) %>%
    select(mun, date, cases) %>%
    collect()

Prepare data

Convert panel data to a list of ts objects.

tdengue_df <- tdengue %>%
  arrange(mun, date) %>%
  select(-date) %>%
  nest(data = cases, .by = mun)

tdengue_list <- lapply(tdengue_df$data, ts)

Time series features

tsf <- tsfeatures(
  tslist = tdengue_list, 
  features = c("entropy", "stability",
               "lumpiness", "flat_spots",
               "zero_proportion", "stl_features",
               "acf_features")
)
  
tsf$mun <- tdengue_df$mun

All features available at the tsfeatures package were computed. Bellow, details about some of them.

Shannon entropy

Measures the “forecastability” of a time series, where low values indicate a high signal-to-noise ratio, and large values occur when a series is difficult to forecast.

\[ -\int^\pi_{-\pi}\hat{f}(\lambda)\log\hat{f}(\lambda) d\lambda \]

ggplot(tsf, aes(x = entropy)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

Stability & lumpiness

Stability and lumpiness are two time series features based on tiled (non-overlapping) windows. Means or variances are produced for all tiled windows. Then stability is the variance of the means, while lumpiness is the variance of the variances.

ggplot(tsf, aes(x = stability)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

ggplot(tsf, aes(x = lumpiness)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

Flat spots

Flat spots are computed by dividing the sample space of a time series into ten equal-sized intervals, and computing the maximum run length within any single interval.

ggplot(tsf, aes(x = flat_spots)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

STL features decomposition

Trend

ggplot(tsf, aes(x = trend)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

Spike

ggplot(tsf, aes(x = spike)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

Linearity

ggplot(tsf, aes(x = linearity)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

Curvature

ggplot(tsf, aes(x = curvature)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

First autocorrelation coefficient

ggplot(tsf, aes(x = e_acf1)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

Sum of the first ten squared autocorrelation coefficients

ggplot(tsf, aes(x = e_acf10)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

Autocorrelation function (ACF) features

ggplot(tsf, aes(x = x_acf1)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

ggplot(tsf, aes(x = x_acf10)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

ggplot(tsf, aes(x = diff1_acf1)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

ggplot(tsf, aes(x = diff1_acf10)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

ggplot(tsf, aes(x = diff2_acf1)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

ggplot(tsf, aes(x = diff2_acf10)) +
  geom_histogram(bins = 50, alpha = .7, fill = "purple") +
  theme_bw()

Clustering

This procedure goal is to cluster the municipalities considering time series features similarities.

K-means clustering

Cluster the municipalities based solely on the time series features.

points <- tsf %>%
  select(-mun)

Uses $k$ from 2 to 10 for clustering.

kclusts <- 
  tibble(k = 2:10) %>%
  mutate(
    kclust = map(k, ~kmeans(points, .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, points)
  )

Isolate results.

clusters <- 
  kclusts %>%
  unnest(cols = c(tidied))

assignments <- 
  kclusts %>% 
  unnest(cols = c(augmented))

clusterings <- 
  kclusts %>%
  unnest(cols = c(glanced))

The total sum of squares is plotted. The $k=5$ seems to be a break point.

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line() +
  geom_point() +
  theme_bw()

silhouette_score <- function(k){
  km <- kmeans(points, centers = k, nstart=25)
  ss <- cluster::silhouette(km$cluster, dist(points))
  mean(ss[, 3])
}
k <- 2:10
avg_sil <- sapply(k, silhouette_score)
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)

Identify municipalities and cluster id

Finally, the cluster partition ID is added to the main dataset.

cluster_ids <- clusterings %>%
  filter(k == 5) %>%
  pull(augmented) %>%
  pluck(1) %>%
  select(group = .cluster) %>%
  mutate(mun = tdengue_df$mun)

Cluster sizes

table(cluster_ids$group)


  1   2   3   4   5 
 37 225  56 259 102

Cluster time series plot

inner_join(tdengue, cluster_ids, by = "mun") %>%
  ggplot(aes(x = date, y = cases, color = mun)) +
  geom_line(alpha = .3) +
  facet_wrap(~group) +
  theme_bw() +
  theme(legend.position = "none")

Time series features per group

Add group Id to time series feautures.

tsf_group <- left_join(tsf, cluster_ids, by = "mun")