STA 313 - Visualizing time series data II

Team	TA
the_tibbles	Lorenzo
stats_fm	Lorenzo
phans_of_statistics	Lorenzo
skaz	Lorenzo
messi	Lorenzo
team_six	Lorenzo
cia	Jackie
blue_team	Jackie

Team	TA
pipe_it_up	Jackie
co_medians	Jackie
viz_villians	Jackie
o_ggs	Jackie
visualization_warriors	Sam
ggplot_lessthan_3	Sam
marvel_cinematic_tidyverse	Sam
rgodz	Sam

Detrending

Detrending is removing prominent long-term trend in time series to specifically highlight any notable deviations
Let’s demonstrate using multiple years of AQI data

Multiple years of Durham-Chapel Hill data

dch_files <- fs::dir_ls(here::here("data/durham-chapel-hill"))
dch_files

/Users/mine/Desktop/Teaching/Duke/sta313-s23/vizdata-s23/slides/08/data/durham-chapel-hill/ad_aqi_tracker_data-2016.csv
/Users/mine/Desktop/Teaching/Duke/sta313-s23/vizdata-s23/slides/08/data/durham-chapel-hill/ad_aqi_tracker_data-2017.csv
/Users/mine/Desktop/Teaching/Duke/sta313-s23/vizdata-s23/slides/08/data/durham-chapel-hill/ad_aqi_tracker_data-2018.csv
/Users/mine/Desktop/Teaching/Duke/sta313-s23/vizdata-s23/slides/08/data/durham-chapel-hill/ad_aqi_tracker_data-2019.csv
/Users/mine/Desktop/Teaching/Duke/sta313-s23/vizdata-s23/slides/08/data/durham-chapel-hill/ad_aqi_tracker_data-2020.csv
/Users/mine/Desktop/Teaching/Duke/sta313-s23/vizdata-s23/slides/08/data/durham-chapel-hill/ad_aqi_tracker_data-2021.csv
/Users/mine/Desktop/Teaching/Duke/sta313-s23/vizdata-s23/slides/08/data/durham-chapel-hill/ad_aqi_tracker_data-2022.csv

Reading multiple files

dch <- read_csv(dch_files, na = c(".", ""))

dch <- dch |>
  janitor::clean_names() |>
  mutate(
    date = mdy(date),
    good_aqi = if_else(aqi_value <= 50, 1, 0)
  ) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(cumsum_good_aqi = cumsum(good_aqi), .after = aqi_value)

dch

# A tibble: 2,547 × 13
  date       aqi_value cumsum_go…¹ main_…² site_…³ site_id source
  <date>         <dbl>       <dbl> <chr>   <chr>   <chr>   <chr> 
1 2016-01-01        32           1 PM2.5   Durham… 37-063… AQS   
2 2016-01-02        37           2 PM2.5   Durham… 37-063… AQS   
3 2016-01-03        45           3 PM2.5   Durham… 37-063… AQS   
4 2016-01-04        33           4 PM2.5   Durham… 37-063… AQS   
5 2016-01-05        27           5 PM2.5   Durham… 37-063… AQS   
# … with 2,542 more rows, 6 more variables:
#   x20_year_high_2000_2019 <dbl>, x20_year_low_2000_2019 <dbl>,
#   x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>,
#   date_of_20_year_low <chr>, good_aqi <dbl>, and abbreviated
#   variable names ¹cumsum_good_aqi, ²main_pollutant, ³site_name

dch |>
  ggplot(aes(x = date, y = cumsum_good_aqi, group = 1)) +
  geom_smooth(method = "lm", color = "pink") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")dch |>
  ggplot(aes(x = date, y = cumsum_good_aqi, group = 1)) +
  geom_smooth(method = "lm", color = "pink") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

`geom_smooth()` using formula = 'y ~ x'

Detrend

Step 1. Fit a simple linear regression

m <- lm(cumsum_good_aqi ~ date, data = dch)

m


Call:
lm(formula = cumsum_good_aqi ~ date, data = dch)

Coefficients:
(Intercept)         date  
 -1.341e+04    7.954e-01

Detrend

Step 2. Augment the data with model results (using broom::augment())

dch_aug <- augment(m)

dch_aug

# A tibble: 2,547 × 8
  cumsum_good_…¹ date       .fitted .resid    .hat .sigma .cooksd
           <dbl> <date>       <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
1              1 2016-01-01   -42.8   43.8 0.00157   25.4 0.00234
2              2 2016-01-02   -42.0   44.0 0.00157   25.4 0.00236
3              3 2016-01-03   -41.3   44.3 0.00156   25.4 0.00238
4              4 2016-01-04   -40.5   44.5 0.00156   25.4 0.00240
5              5 2016-01-05   -39.7   44.7 0.00156   25.4 0.00242
# … with 2,542 more rows, 1 more variable: .std.resid <dbl>, and
#   abbreviated variable name ¹cumsum_good_aqi

Detrend

Step 3. Divide the observed value of cumsum_good_aqi by the respective value in the long-term trend (i.e., .fitted)

dch_aug <- dch_aug |>
  mutate(ratio = cumsum_good_aqi / .fitted, .after = .fitted)


dch_aug

# A tibble: 2,547 × 9
  cumsum_good_…¹ date       .fitted   ratio .resid    .hat .sigma
           <dbl> <date>       <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
1              1 2016-01-01   -42.8 -0.0233   43.8 0.00157   25.4
2              2 2016-01-02   -42.0 -0.0476   44.0 0.00157   25.4
3              3 2016-01-03   -41.3 -0.0727   44.3 0.00156   25.4
4              4 2016-01-04   -40.5 -0.0989   44.5 0.00156   25.4
5              5 2016-01-05   -39.7 -0.126    44.7 0.00156   25.4
# … with 2,542 more rows, 2 more variables: .cooksd <dbl>,
#   .std.resid <dbl>, and abbreviated variable name
#   ¹cumsum_good_aqi

Visualize detrended data

Plot
Code

dch_aug |>
  ggplot(aes(x = date, y = ratio, group = 1)) +
  geom_hline(yintercept = 1, color = "gray") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days\n(detrended)",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

Air Quality in Durham

barely anything interesting happening!

let’s look at data from somewhere with a bit more “interesting” air quality data…

Read in multiple years of SF data

sf_files <- fs::dir_ls(here::here("data/san-francisco"))

sf <- read_csv(sf_files, na = c(".", ""))

sf <- sf |>
  janitor::clean_names() |>
  mutate(
    date = mdy(date),
    good_aqi = if_else(aqi_value <= 50, 1, 0)
  ) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(cumsum_good_aqi = cumsum(good_aqi), .after = aqi_value)

sf

# A tibble: 2,557 × 13
  date       aqi_value cumsum_go…¹ main_…² site_…³ site_id source
  <date>         <dbl>       <dbl> <chr>   <chr>   <chr>   <chr> 
1 2016-01-01        32           1 PM2.5   Durham… 37-063… AQS   
2 2016-01-02        37           2 PM2.5   Durham… 37-063… AQS   
3 2016-01-03        45           3 PM2.5   Durham… 37-063… AQS   
4 2016-01-04        33           4 PM2.5   Durham… 37-063… AQS   
5 2016-01-05        27           5 PM2.5   Durham… 37-063… AQS   
# … with 2,552 more rows, 6 more variables:
#   x20_year_high_2000_2019 <dbl>, x20_year_low_2000_2019 <dbl>,
#   x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>,
#   date_of_20_year_low <chr>, good_aqi <dbl>, and abbreviated
#   variable names ¹cumsum_good_aqi, ²main_pollutant, ³site_name

Plot trend since 2016

Plot
Code

sf |>
  ggplot(aes(x = date, y = cumsum_good_aqi, group = 1)) +
  geom_smooth(method = "lm", color = "pink") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "San Francisco-Oakland-Hayward, CA",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

`geom_smooth()` using formula = 'y ~ x'

Detrend

Fit a simple linear regression

m_sf <- lm(cumsum_good_aqi ~ date, data = sf)

Augment the data with model results

sf_aug <- augment(m_sf)

Divide the observed value of cumsum_good_aqi by the respective value in the long-term trend (i.e., .fitted)

sf_aug <- sf_aug |>
  mutate(ratio = cumsum_good_aqi / .fitted, .after = .fitted)

Visualize detrended data

Plot
Code

sf_aug |>
  ggplot(aes(x = date, y = ratio, group = 1)) +
  geom_hline(yintercept = 1, color = "gray") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days\n(detrended)",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "San Francisco-Oakland-Hayward, CA",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

Detrending

In step 2 we fit a very simple model
Depending on the complexity you’re trying to capture you might choose to fit a much more complex model
You can also decompose the trend into multiple trends, e.g. monthly, long-term, seasonal, etc.

Visualizing time series data II

Warm up

Announcements

Project next steps I

Project next steps II

Project next steps III

Setup

Working with dates

AQI levels

AQI data

2022 Durham-Chapel Hill

Visualizing Durham AQI

Livecoding

Another visualization of Durham AQI

Highlights

Calculating cumulatives

Cumulatives over time

Calculating cumulatives

Calculating cumulatives

Calculating cumulatives

Plotting cumulatives

Detrending

Detrending

Multiple years of Durham-Chapel Hill data

Reading multiple files

Plot trend since 2016

Detrend

Detrend

Detrend

Visualize detrended data

Air Quality in Durham

Read in multiple years of SF data

Plot trend since 2016

Detrend

Visualize detrended data

Detrending