Multilingual Time Series EDA

Analyzing my household energy usage

Author

Santiago Rodriguez

Published

October 11, 2024

Modified

November 6, 2024

TL;DR

A code-friendly tutorial about exploratory data analysis (EDA) using my household energy usage. The article explores plots and technical topics such as stationarity.

Outcomes:

  • the log transformation is useful
  • differencing d = 365 helps remove the annual seasonality in the data

About

This article is part of a larger series. The primary component underpinning the rest is a data engineering pipeline that preps the data for analysis. The culmination of this work is a forecasting API hosted on my household server. The insights from this EDA will guide the forecasting workflow.

Finally, a note about code. As a multilingual programmer (Python, R, and Julia), I will display Python and R output. For brevity, I’ve chosen not to show the code itself. However, the code base is publicly available here.

My household energy usage

Without further ado. I present my household energy usage data.

The start date of the time series is 2021-01-16 and the latest date is 2024-11-06, and there are 1,367 days in the series.

To create the plots I used the ggplot2 package in R and the plotnine package in Python. Why plotnine? For a few reasons. First, it’s a replica of ggplot, which means I don’t have to learn new syntax. Second, I can without many adjustments copy and paste the R code. Efficiency is important. As is avoiding repetition - I learned the DRY principle from The Pragmatic Programmer (Hunt and Thomas 1999).

Let me show you what I mean. Below are the import statements for ggplot and plotnine.

```{r}
box::use(
  ggplot2[aes, ggplot, geom_line, geom_point],
)
```
```{python}
from plotnine import aes, ggplot, geom_line, geom_point
```

See? They’re the same/ similar. The only difference is that the R code is imported via the box package. What is box? That is a conversation for a different time. Suffice to say that box is a modern way of importing packages and namespaces in R. Using box makes transitioning between languages breezy.

Unhide
(
  ggplot(r.df_energy, aes(x='date', y='kwh')) +
  geom_line(color='navy', group=1) +
  labs(
      title="Household Energy usage",
      x="Date",
      y="Kwh"
  ) +
  theme_minimal(base_size = 15) +
  theme(
      panel_grid = element_blank(),
      plot_title=element_text(size=14)
  ) +
  scale_x_datetime(
    breaks=date_breaks("1 years"),
    labels=date_format("%Y")
  )
) \
.show()

Unhide
ggplot(
  data = df_energy,
  aes(x = date, y = kwh)
) +
  geom_line(color = "navy") +
  labs(
    title = "Household Energy usage",
    x = "Date",
    y = "Kwh"
  ) +
  theme_minimal(base_size = 15) +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 14)
  )

Time series EDA

I am performing EDA on my household energy data because I want to create a forecast. Before jumping into forecasting work though a better understanding of the data is needed. This is especially true when working with statistical models.

In the time series class in grad school we used Time Series, A Data Analysis Approach Using R (Robert H. Shumway 2019). I’ll use this book as the guide for performing EDA. It’s worth noting that there are many great resources for time series, such as anything by Rob Hyndman.

Our focus is to assess what transformations are needed to achieve stationarity.

definition of stationarity

Stationarity is particularly important for statistical models. In the following sections we’ll perform a procedure and assess if the data is stationary via a QQ-plot. I find plots more useful than using a hypothesis test.

Unhide
qqplot(r.df_energy["kwh"], line="q")

Unhide
qqnorm(df_energy[["kwh"]])
qqline(df_energy[["kwh"]])

Based on the above plots there’s no need to formally test anything. It’s obvious the data isn’t normally distributed. The data has lighter tails than the \(N()\) on the right hand side and heavier tails than the \(N()\) on the left hand side.

I’ll also now introduce the ACF and PACF plots. From Time Series (Robert H. Shumway 2019):

AR\((p)\) MA\((q)\) ARMA\((p, q)\)
ACF Tails off Cuts off after lag \(q\) Tails off
PACF Cuts off after lag \(p\) Tails off Tails off

Using the table above we’ll visually analyze the plots below of the raw data to determine if the data has an AR, MA, or ARMA specification.

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh'], ax=axes[0], lags=365, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh'], ax=axes[1], lags=365, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(
  ts_data,
  max.lag = 365,
  main = "Energy Usage"
)

Based on a 1-year view of the data it almost appears if the raw data could be specified by an AR(1) model.

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh'], ax=axes[0], lags=31, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh'], ax=axes[1], lags=31, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(ts_data, max.lag = 31, main = "Energy Usage")

The takeaway is the same looking at lags 1:31, AR(1).

Decomposition

First, is a traditional approach to time series EDA. This approach deconstructs the time series into separate elements.

Unhide
# STL decomposition with a periodic seasonal component
stl = STL(
  r.df_energy[['kwh']],
  seasonal=365,
  period = 365
)
result = stl.fit()

# Plotting the results
result.plot()
plt.show()

Unhide
plot(stl(ts_data, s.window = "periodic"))

Transformations

From school and experience, I’ve found that the log transformation is worth exploring. This is because the log flattens out perturbations. As can be seen in the plot of my household energy usage, there are quite a few large spikes over time.

Unhide
# Create the plot
(
  ggplot(r.df_energy, aes(x='date', y='kwh_log')) +
  geom_line(color='navy', group=1) +
  labs(
      title="Logged Energy Usage",
      x="Date",
      y="Energy Usage"
  ) +
  geom_hline(yintercept=0, linetype='dashed') +
  theme_minimal(base_size = 15) +
  theme(
      panel_grid = element_blank(),
      plot_title=element_text(size=14)
  ) +
  scale_x_datetime(
    breaks=date_breaks("1 years"),
    labels=date_format("%Y")
  )
) \
.show()

Unhide
ggplot(
  data = df_energy,
  aes(x = date, y = kwh_log)
) +
  geom_line(color = "navy") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Loggged Energy Usage",
    x = "Date",
    y = "Energy Usage"
  ) +
  theme_minimal(base_size = 15) +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 14)
  )

Now, to view the QQ-plot of the log transformed data. If the transformation had a positive effect, the data should be more normal.

Unhide
qqplot(r.df_energy["kwh_log"], line="q")

Unhide
qqnorm(df_energy[["kwh_log"]])
qqline(df_energy[["kwh_log"]])

The results of applying the log transformation to the data looks good. The QQ-plot looks much better. There’s still some deviation around \(\pm 2\). Perhaps something else is needed. For now, though, the subsequent analyses will use the log-transformed data.

And as before, the ACF and PACF plots.

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh_log'], ax=axes[0], lags=365, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh_log'], ax=axes[1], lags=365, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(log(ts_data), max.lag = 365, main = "Log Energy Usage")

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh_log'], ax=axes[0], lags=31, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh_log'], ax=axes[1], lags=31, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(log(ts_data), max.lag = 31, main = "Log Energy Usage")

The log transformation hasn’t changed the earlier interpretation of the ACF and PACF plots. The data still resembles an AR(1) process.

Lag plots

So the log transformation helped the data look more normal. Next, another common task in time series EDA, autocorrelation. There are specific methods to assess this, but first a visual building block: lag plots.

The period or frequency of the data is daily (i.e., 365). It would be too cumbersome to plot 365 lags. Instead for illustrative purposes, let’s keep the max lag \(\leq 12\). I chose 12 arbitrarily.

Unhide
max_lag = 12
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(15, 10))

for lag, ax in zip(range(1, max_lag + 1), axes.flatten()):
    lag_plot(r.df_energy['kwh_log'], lag=lag, ax=ax)
    ax.set_title(f'Lag {lag}')
    
    correlation = r.df_energy['kwh_log'].autocorr(lag)
    
    ax.text(
      0.95, 0.05,
      f"Corr: {correlation:.2f}", 
      transform=ax.transAxes, 
      ha="right", va="bottom", 
      fontsize=10, 
      bbox=dict(facecolor='white', alpha=0.5)
    )

plt.tight_layout()
plt.show()

Unhide
lag1.plot(log(ts_data), max.lag = 12)

The data has a strong correlation (\(\rho \gt 0.5\)) through lag 4. Interpretation: at any given day, my household’s energy usage is related to the prior four days energy usage.

So, what? Serial correlation or auto correlation violates an assumption of linear regression. And since many stats models are regression based, that’s a problem.

Differencing

Differncing is a useful technique to address auto correlation.

It’s useful to think about the data before blindly doing things. This is especially true for differencing, there are risks to over-differencing.

Given what I know about the data and based on the first plot of the data, it seems useful to test for day-over-day correlations and annual seasonality. Any given day’s enery usage \(t_i\) is likely similar to that of the day before \(t_{i-1}\) - from the lag plot the correlation at lag 1 is \(\approx 0.8\). While the lag plot only displayed lags 1 through 12, common sense tells us that the weather or climate impacts energy usage. We know that summers tend to be similar. So, the climate in July 2024 is probably similar to the climate of prior years.

This is differencing notation \(\bigtriangledown^d\), \(d\) denotes the order.

When:

  • \(d\) = 1, removes a linear trend
  • \(d\) = 2, removes a quadtratic trend

It’s possible to chain operations. For example, first log transform the data then difference. Transformations were discussed earlier because they are often applied before differencing.

Day-over-day

Unhide
# Create the plot
(
  ggplot(
    tmp,
    aes(
      x='date',
      y='y'
    )
  ) +
  geom_line(color='navy', group=1) +
  labs(
      title="Loggged and Differenced Energy Usage",
      subtitle="d1",
      x="Date",
      y="Energy Usage"
  ) +
  geom_hline(yintercept=0, linetype='dashed') +
  theme_minimal(base_size = 15) +
  theme(
      panel_grid = element_blank(),
      plot_title=element_text(size=14)
  ) +
  scale_x_datetime(
    breaks=date_breaks("1 years"),
    labels=date_format("%Y")
  )
) \
.show()

Unhide
data.frame(
  x = time(diff(log(ts_data))),
  y = diff(log(ts_data))
) |>
  ggplot(
    data = _,
    aes(x = x, y = y)
  ) +
  geom_line(color = "navy") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Loggged and Differenced Energy Usage",
    subtitle = "d1",
    x = "Date",
    y = "Energy Usage"
  ) +
  theme_minimal(base_size = 15) +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 14)
  )

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh_log'].diff().dropna(), ax=axes[0], lags=365, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh_log'].diff().dropna(), ax=axes[1], lags=365, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(
  diff(log(ts_data)),
  max.lag = 365,
  main = "Difference 1 Log Energy Usage"
)

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh_log'].diff().dropna(), ax=axes[0], lags=31, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh_log'].diff().dropna(), ax=axes[1], lags=31, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(
  diff(log(ts_data)),
  max.lag = 31,
  main = "Difference 1 Log Energy Usage"
)

Unhide
qqplot(r.df_energy["kwh_log"].diff().dropna(), line="q")

Unhide
qqnorm(diff(df_energy[["kwh_log"]]))
qqline(diff(df_energy[["kwh_log"]]))

So, differencing the log transformed data resulted in a worse QQ plot and a less obvious interpretation of the ACF and PACF plots, but it sort of looks like an MA(1) might fit.

Annual seasonality

Unhide
(
  ggplot(
    tmp,
    aes(
      x='date',
      y='y'
    )
  ) +
  geom_line(color='navy', group=1) +
  labs(
      title="Loggged and Differenced (Annually) Energy Usage",
      subtitle="D1",
      x="Date",
      y="Energy Usage"
  ) +
  geom_hline(yintercept=0, linetype='dashed') +
  theme_minimal(base_size = 15) +
  theme(
      panel_grid = element_blank(),
      plot_title=element_text(size=14)
  ) +
  scale_x_datetime(
    breaks=date_breaks("1 years"),
    labels=date_format("%Y")
  )
) \
.show()

Unhide
data.frame(
  x = time(diff(log(ts_data), 365)),
  y = diff(log(ts_data), 365)
) |>
  ggplot(
    data = _,
    aes(x = x, y = y)
  ) +
  geom_line(color = "navy") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Loggged and Differenced (Annually) Energy Usage",
    subtitle = "D1",
    x = "Date",
    y = "Energy Usage"
  ) +
  theme_minimal(base_size = 15) +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 14)
  )

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh_log'].diff(365).dropna(), ax=axes[0], lags=365, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh_log'].diff(365).dropna(), ax=axes[1], lags=365, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(
  diff(log(ts_data), 365),
  max.lag = 365,
  main = "Difference 365 Log Energy Usage"
)

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh_log'].diff(365).dropna(), ax=axes[0], lags=31, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh_log'].diff(365).dropna(), ax=axes[1], lags=31, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(
  diff(log(ts_data), 365),
  max.lag = 31,
  main = "Difference 365 Log Energy Usage"
)

Unhide
qqplot(r.df_energy["kwh_log"].diff(365).dropna(), line="q")

Unhide
qqnorm(diff(df_energy[["kwh_log"]], 365))
qqline(diff(df_energy[["kwh_log"]], 365))

So, differencing the log transformed data resulted in a worse QQ plot than the logged data but a better QQ plot than the differenced order 1 plot. It sort of looks like an AR(1) fits.

Combined differences

Unhide
(
  ggplot(
    tmp,
    aes(
      x='date',
      y='y'
    )
  ) +
  geom_line(color='navy', group=1) +
  labs(
      title="Loggged and Differenced Energy Usage",
      subtitle="d1 D1",
      x="Date",
      y="Energy Usage"
  ) +
  geom_hline(yintercept=0, linetype='dashed') +
  theme_minimal(base_size = 15) +
  theme(
      panel_grid = element_blank(),
      plot_title=element_text(size=14)
  ) +
  scale_x_datetime(
    breaks=date_breaks("1 years"),
    labels=date_format("%Y")
  )
) \
.show()

Unhide
data.frame(
  x = time(diff(diff(log(ts_data)), 365)),
  y = diff(diff(log(ts_data)), 365)
) |>
  ggplot(
    data = _,
    aes(x = x, y = y)
  ) +
  geom_line(color = "navy") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Loggged and Differenced Energy Usage",
    subtitle = "d1 D1",
    x = "Date",
    y = "Energy Usage"
  ) +
  theme_minimal(base_size = 15) +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 14)
  )

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh_log'].diff().diff(365).dropna(), ax=axes[0], lags=365, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh_log'].diff().diff(365).dropna(), ax=axes[1], lags=365, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(
  diff(diff(log(ts_data)), 365),
  max.lag = 365,
  main = "Difference 365 Log Energy Usage"
)

Unhide
# Plot ACF and PACF side by side
fig, axes = plt.subplots(2, 1)

# ACF plot
_ = plot_acf(r.df_energy['kwh_log'].diff().diff(365).dropna(), ax=axes[0], lags=31, zero=False)
_ = axes[0].set_title("ACF Plot")

# PACF plot
_ = plot_pacf(r.df_energy['kwh_log'].diff().diff(365).dropna(), ax=axes[1], lags=31, method='ywm', zero=False)
_ = axes[1].set_title("PACF Plot")

plt.tight_layout()
plt.show()

Unhide
tmp <- acf2(
  diff(diff(log(ts_data)), 365),
  max.lag = 31,
  main = "Difference 365 Log Energy Usage"
)

Unhide
qqplot(r.df_energy["kwh_log"].diff().diff(365).dropna(), line="q")

Unhide
qqnorm(diff(diff(log(ts_data)), 365))
qqline(diff(diff(log(ts_data)), 365))

🤔 I don’t think double differencing is the way to go.

Conclusion

  1. log transform the energy data and assess how an AR(1) fits
  2. differencing the log transformed energy data by order d = 1 and asses how an MA(1) fits
  3. annual differencing the log transformed energy data by order D = 1 and assess how an AR(1) fis

Bonus material

Weather data

Unhide
(
  ggplot(r.df_weather_no_missing, aes(x='date', y='tmax')) +
  geom_line(color='darkorange', group=1) +
  labs(
      title="Max Daily Temperature",
      x="Date",
      y="Temp"
  ) +
  theme_minimal(base_size = 15) +
  theme(
      panel_grid = element_blank(),
      plot_title=element_text(size=14)
  ) +
  scale_x_datetime(
    breaks=date_breaks("1 years"),
    labels=date_format("%Y")
  ) +
  ylim(0, np.max(r.df_weather_no_missing[['tmax']]))
) \
.show()

Unhide
ggplot(
  data = df_weather_no_missing,
  aes(x = date, y = tmax)
) +
  geom_line(color = "darkorange") +
  labs(
    title = "Max Daily Temperature",
    x = "Date",
    y = "Temp"
  ) +
  theme_minimal(base_size = 15) +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(size = 14)
  ) +
  ylim(0, max(df_weather_no_missing[["tmax"]], na.rm = TRUE))

Cross-Correlation Function (CCF)

From Time Series (Robert H. Shumway 2019):

If \(x_t\) and \(y_t\) are independent processes, then under mild conditions, the large sample distribution of \(\hat{\rho}_{xy}(h)\) is normal with mean zero and standard deviation \(\frac{1}{\sqrt{n}}\) **if at least one of the processes is independent white noise.

So, in order to effectively use the CCF to analyze the correlation between max temperature and my household energy usage, at least one of the time series must be white noise.

Previously we noted that the log transformation was helpful but the data still appeared to have a slight upward trend and seasonality. Of the differncing options analyzed above, the annual differencing looked most like white noise and the QQ plot was the best of the differenced approaches. So below we’ll assess the cross-correlation between energy usage and max temperature using the log and annual differencing.

Unhide
plot_ccf(
  x = tmp['kwh_log'],
  y = tmp["tmax"],
  title = "Log and Differenced Energy ~ Max Daily Temperature",
  auto_ylims = True,
  lags = 365*2,
  negative_lags=False
)

Unhide
ccf2(
  x = diff(diff(log(ts_data)), 365),
  y = ts_weather,
  main = "Log and Differenced Energy ~ Max Daily Temperature",
  max.lag = 365 * 2
)

Unhide
plot_ccf(
  x = tmp['kwh_log'],
  y = tmp["tmax"],
  title = "Log and Differenced Energy ~ Max Daily Temperature",
  lags = 31,
  auto_ylims = True,
  negative_lags=False
)

Unhide
ccf2(
  x = diff(diff(log(ts_data)), 365),
  y = ts_weather,
  main = "Log and Differenced Energy ~ Max Daily Temperature",
  max.lag = 31
)

My household energy usage lags max temperature by 7 days but the correlation is weak.

Appendix

Technical gotchas

Quarto seems capable of using a single engine at a time. Thus, for multilingual projects authors must choose a rendering language. For this project I chose the knitr engine, which renders Python code using the reticulate package. This adds a layer of complexity as running Python through reticulate isn’t the same as running Python code directly. Overall, though the process was quite painless.

References

Hunt, Andrew, and David Thomas. 1999. The Pragmatic Programmer: Your Journey to Mastery. 1st ed. Boston, MA: Addison-Wesley Professional. https://pragprog.com/titles/tpp20/the-pragmatic-programmer/.
Robert H. Shumway, David S. Stoffer. 2019. Time Series, a Data Analysis Approach Using r. 6000 Broken Sound Parkway NW, Ste 300, Boca Raton, FL 33487: CRC Press, Taylor & Francis Group.