Moving average with changing window

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Moving average with changing window



I would like to calculate the moving average of a variable in R with a changing window size. To be more specific: The moving average should be calculated over three years, but the data (time-series) is available in higher frequency and the window size can vary for each three-year window.



Assume the following dataset:


library(data.table)
set.seed(1) # reproduceable data
dataset <- data.table(ID=c(rep("A",2208),rep("B",2208)),
x = c(rnorm(2208*2)), time=c(seq(as.Date("1988/03/15"),
as.Date("2000/04/16"), "day"),seq(as.Date("1988/03/15"),
as.Date("2000/04/16"), "day")))



The three-year moving average of the variable x should be calculated for both of the IDs (person A and B). Can this be done with zoo and datatable preferably? But any solution is fine.


zoo


datatable



Please note I know how to do this with a fixed window size, the problem here is the varying window size.





What you mean by the window size can vary for each three-year window?
– A. Suliman
Aug 8 at 15:35





I think the OP means that the window will span three years, but that the number of observations within those years is variable.
– Lyngbakr
Aug 8 at 15:37





If dataset size is not an issue, then create a data.table starting with min date, to max date and join your dataset, then you can use a fixed sliding window size.
– Robert Tan
Aug 8 at 15:37





@Lyngbakr that is what I meant, thank you. Unfortunately the dataset size is an issue, but even if it wasn't, the problem with irregular window size would exist, because in a given year there are not necessarily the same amount of days as in another one (i.e. in leapyears there is always a difference from normal years)?
– Talik3233
Aug 8 at 15:41






I don't know if it works, but what about dataset %>% group_by(ID) %>% mutate(avg=rollapplyr(x,mean,width=3*365,fill=NA)), As you already have daily measuerments without missing values.
– A. Suliman
Aug 8 at 16:44



dataset %>% group_by(ID) %>% mutate(avg=rollapplyr(x,mean,width=3*365,fill=NA))




2 Answers
2



If I understand correctly the OP wants to span exactly 3 years. As this may include leap years, the window size can be either 1095 days or 1096 days.



This can be solved by aggregating in a non-equi join together with lubridate's rollback date arithmetic.


lubridate


library(data.table)
library(lubridate)
# create 3 years windows for each ID for later non-equi join
win <- dataset[, CJ(ID = ID, start = time, unique = TRUE)][
# make sure to pick
, end := start %m+% years(3) - days(1)][
# remove windows which end out of date range
end <= max(start)]
win


ID start end
1: A 1988-03-15 1991-03-14
2: A 1988-03-16 1991-03-15
3: A 1988-03-17 1991-03-16
4: A 1988-03-18 1991-03-17
5: A 1988-03-19 1991-03-18
---
6638: B 1997-04-13 2000-04-12
6639: B 1997-04-14 2000-04-13
6640: B 1997-04-15 2000-04-14
6641: B 1997-04-16 2000-04-15
6642: B 1997-04-17 2000-04-16


# check window lengths
win[, .N, by = .(days = end - start + 1L)]


days N
1: 1095 days 2166
2: 1096 days 4476


# see what happens in leap years
win[leap_year(start) & month(start) == 2 & day(start) %in% 28:29,
.(start, end, days = end - start + 1L)]


start end days
1: 1992-02-28 1995-02-27 1096 days
2: 1992-02-29 1995-02-27 1095 days
3: 1996-02-28 1999-02-27 1096 days
4: 1996-02-29 1999-02-27 1095 days
5: 1992-02-28 1995-02-27 1096 days
6: 1992-02-29 1995-02-27 1095 days
7: 1996-02-28 1999-02-27 1096 days
8: 1996-02-29 1999-02-27 1095 days


win[leap_year(end) & month(end) == 2 & day(end) %in% 28:29,
.(start, end, days = end - start + 1L)]


start end days
1: 1989-03-01 1992-02-29 1096 days
2: 1993-03-01 1996-02-29 1096 days
3: 1997-03-01 2000-02-29 1096 days
4: 1989-03-01 1992-02-29 1096 days
5: 1993-03-01 1996-02-29 1096 days
6: 1997-03-01 2000-02-29 1096 days


# aggregate in a non-equi-join
dataset[win, on = .(ID, time >= start, time <= end), by = .EACHI, .(avg = mean(x))]


ID time time avg
1: A 1988-03-15 1991-03-14 -0.01184078
2: A 1988-03-16 1991-03-15 -0.01317813
3: A 1988-03-17 1991-03-16 -0.01179571
4: A 1988-03-18 1991-03-17 -0.01006100
5: A 1988-03-19 1991-03-18 -0.01221798
---
6638: B 1997-04-13 2000-04-12 -0.03412214
6639: B 1997-04-14 2000-04-13 -0.03604176
6640: B 1997-04-15 2000-04-14 -0.03556291
6641: B 1997-04-16 2000-04-15 -0.03392185
6642: B 1997-04-17 2000-04-16 -0.03393674





thanks this looks perfect to me. Just one question: Could you also formulate this with the usual merge syntax from datatable?
– Talik3233
Aug 8 at 19:56


merge


datatable





No, the goodies like non-equi join and grouping by each i are only available with the [.data.table syntax. data.table's merge() is more or less only a faster implementation of base R's merge(). There is a section in the FAQ which explains the differences.
– Uwe
Aug 8 at 21:45



[.data.table


data.table


merge()


merge()





I understand, thanks. I was asking because in my opinion the syntax of merge is cleaner. It's harder to remember the difference between X[Y] and Y[X] imho. But the solution is great, so I guess I need to get used to this.
– Talik3233
Aug 8 at 21:49


merge





Yes, merge() is simpler but less powerful. X[Y] behaves like a right join, i.e., take all rows of Y and try to find matches in X.
– Uwe
Aug 8 at 21:55


merge()


X[Y]


Y


X





There is the package fuzzyjoin that could work as well and has syntax close to merge (and even closer to dplyr joins), but I believe it will be slower.
– Moody_Mudskipper
Aug 9 at 6:25


fuzzyjoin


merge


dplyr



As @A.Suliman mentions in comments your example data has a fixed window width, but let's assume your real data hasn't as that's what your text says.



The parameter width from rollapply doesn't have to be constant, so we can compute the width first, make sure we align the window to the left, and run rollaply.


width


rollapply


rollaply


library(zoo)
library(tidyverse)
dataset %>%
arrange(ID,time) %>%
group_by(ID) %>%
mutate(avg = rollapply(x, FUN = mean, align = "left",
width = map_dbl(time, ~which.max(time[time < .x + 3*365.25])) - row_number()+1))
#
# # A tibble: 8,832 x 4
# # Groups: ID [2]
# ID x time avg
# <fctr> <dbl> <date> <dbl>
# 1 A -0.0258 1988-03-15 0.0109
# 2 A -0.0258 1988-03-15 0.0109
# 3 A -0.1562 1988-03-16 0.0115
# 4 A -0.1562 1988-03-16 0.0115
# 5 A 0.8193 1988-03-17 0.0115
# 6 A 0.8193 1988-03-17 0.0112
# 7 A -1.1136 1988-03-18 0.0102
# 8 A -1.1136 1988-03-18 0.0107
# 9 A -0.9336 1988-03-19 0.0105
# 10 A -0.9336 1988-03-19 0.0109





I'm in a hurry so didn't bother translating the syntax to data.table but I think it should be trivial. Solution wasn't checked in depth, treat with caution!
– Moody_Mudskipper
Aug 8 at 17:08





thanks, interesting. map_dbl seems to be a purrr function, so I guess I would need that package. Do you mind explaining what is the idea behind ~which.max(time[time < .x + 3*365.25])) - row_number()+1) ?
– Talik3233
Aug 8 at 19:52


~which.max(time[time < .x + 3*365.25])) - row_number()+1)





You can use sapply instead of map_dbl, with standard function notation. The idea behind the code you copied is to find the index of last date available before given date + 3 years , subtract from it the index from given date (+1) to get a time window, and use this window in the width argument
– Moody_Mudskipper
Aug 8 at 22:04


sapply


map_dbl


width





At first glance it seems my solution and @Uwe's are pretty much the same in their principles. His is cleaner and all data.table though so you'll be better off focusing on it.
– Moody_Mudskipper
Aug 8 at 22:09





alright, thanks nontheless for the explanation!
– Talik3233
Aug 9 at 21:12






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard