Moving average with changing window
Clash Royale CLAN TAG#URR8PPP
Moving average with changing window
I would like to calculate the moving average of a variable in R with a changing window size. To be more specific: The moving average should be calculated over three years, but the data (time-series) is available in higher frequency and the window size can vary for each three-year window.
Assume the following dataset:
library(data.table)
set.seed(1) # reproduceable data
dataset <- data.table(ID=c(rep("A",2208),rep("B",2208)),
x = c(rnorm(2208*2)), time=c(seq(as.Date("1988/03/15"),
as.Date("2000/04/16"), "day"),seq(as.Date("1988/03/15"),
as.Date("2000/04/16"), "day")))
The three-year moving average of the variable x should be calculated for both of the IDs (person A and B). Can this be done with zoo
and datatable
preferably? But any solution is fine.
zoo
datatable
Please note I know how to do this with a fixed window size, the problem here is the varying window size.
I think the OP means that the window will span three years, but that the number of observations within those years is variable.
– Lyngbakr
Aug 8 at 15:37
If dataset size is not an issue, then create a data.table starting with min date, to max date and join your dataset, then you can use a fixed sliding window size.
– Robert Tan
Aug 8 at 15:37
@Lyngbakr that is what I meant, thank you. Unfortunately the dataset size is an issue, but even if it wasn't, the problem with irregular window size would exist, because in a given year there are not necessarily the same amount of days as in another one (i.e. in leapyears there is always a difference from normal years)?
– Talik3233
Aug 8 at 15:41
I don't know if it works, but what about
dataset %>% group_by(ID) %>% mutate(avg=rollapplyr(x,mean,width=3*365,fill=NA))
, As you already have daily measuerments without missing values.– A. Suliman
Aug 8 at 16:44
dataset %>% group_by(ID) %>% mutate(avg=rollapplyr(x,mean,width=3*365,fill=NA))
2 Answers
2
If I understand correctly the OP wants to span exactly 3 years. As this may include leap years, the window size can be either 1095 days or 1096 days.
This can be solved by aggregating in a non-equi join together with lubridate
's rollback date arithmetic.
lubridate
library(data.table)
library(lubridate)
# create 3 years windows for each ID for later non-equi join
win <- dataset[, CJ(ID = ID, start = time, unique = TRUE)][
# make sure to pick
, end := start %m+% years(3) - days(1)][
# remove windows which end out of date range
end <= max(start)]
win
ID start end
1: A 1988-03-15 1991-03-14
2: A 1988-03-16 1991-03-15
3: A 1988-03-17 1991-03-16
4: A 1988-03-18 1991-03-17
5: A 1988-03-19 1991-03-18
---
6638: B 1997-04-13 2000-04-12
6639: B 1997-04-14 2000-04-13
6640: B 1997-04-15 2000-04-14
6641: B 1997-04-16 2000-04-15
6642: B 1997-04-17 2000-04-16
# check window lengths
win[, .N, by = .(days = end - start + 1L)]
days N
1: 1095 days 2166
2: 1096 days 4476
# see what happens in leap years
win[leap_year(start) & month(start) == 2 & day(start) %in% 28:29,
.(start, end, days = end - start + 1L)]
start end days
1: 1992-02-28 1995-02-27 1096 days
2: 1992-02-29 1995-02-27 1095 days
3: 1996-02-28 1999-02-27 1096 days
4: 1996-02-29 1999-02-27 1095 days
5: 1992-02-28 1995-02-27 1096 days
6: 1992-02-29 1995-02-27 1095 days
7: 1996-02-28 1999-02-27 1096 days
8: 1996-02-29 1999-02-27 1095 days
win[leap_year(end) & month(end) == 2 & day(end) %in% 28:29,
.(start, end, days = end - start + 1L)]
start end days
1: 1989-03-01 1992-02-29 1096 days
2: 1993-03-01 1996-02-29 1096 days
3: 1997-03-01 2000-02-29 1096 days
4: 1989-03-01 1992-02-29 1096 days
5: 1993-03-01 1996-02-29 1096 days
6: 1997-03-01 2000-02-29 1096 days
# aggregate in a non-equi-join
dataset[win, on = .(ID, time >= start, time <= end), by = .EACHI, .(avg = mean(x))]
ID time time avg
1: A 1988-03-15 1991-03-14 -0.01184078
2: A 1988-03-16 1991-03-15 -0.01317813
3: A 1988-03-17 1991-03-16 -0.01179571
4: A 1988-03-18 1991-03-17 -0.01006100
5: A 1988-03-19 1991-03-18 -0.01221798
---
6638: B 1997-04-13 2000-04-12 -0.03412214
6639: B 1997-04-14 2000-04-13 -0.03604176
6640: B 1997-04-15 2000-04-14 -0.03556291
6641: B 1997-04-16 2000-04-15 -0.03392185
6642: B 1997-04-17 2000-04-16 -0.03393674
thanks this looks perfect to me. Just one question: Could you also formulate this with the usual
merge
syntax from datatable
?– Talik3233
Aug 8 at 19:56
merge
datatable
No, the goodies like non-equi join and grouping by each i are only available with the
[.data.table
syntax. data.table
's merge()
is more or less only a faster implementation of base R's merge()
. There is a section in the FAQ which explains the differences.– Uwe
Aug 8 at 21:45
[.data.table
data.table
merge()
merge()
I understand, thanks. I was asking because in my opinion the syntax of
merge
is cleaner. It's harder to remember the difference between X[Y] and Y[X] imho. But the solution is great, so I guess I need to get used to this.– Talik3233
Aug 8 at 21:49
merge
Yes,
merge()
is simpler but less powerful. X[Y]
behaves like a right join, i.e., take all rows of Y
and try to find matches in X
.– Uwe
Aug 8 at 21:55
merge()
X[Y]
Y
X
There is the package
fuzzyjoin
that could work as well and has syntax close to merge
(and even closer to dplyr
joins), but I believe it will be slower.– Moody_Mudskipper
Aug 9 at 6:25
fuzzyjoin
merge
dplyr
As @A.Suliman mentions in comments your example data has a fixed window width, but let's assume your real data hasn't as that's what your text says.
The parameter width
from rollapply
doesn't have to be constant, so we can compute the width first, make sure we align the window to the left, and run rollaply
.
width
rollapply
rollaply
library(zoo)
library(tidyverse)
dataset %>%
arrange(ID,time) %>%
group_by(ID) %>%
mutate(avg = rollapply(x, FUN = mean, align = "left",
width = map_dbl(time, ~which.max(time[time < .x + 3*365.25])) - row_number()+1))
#
# # A tibble: 8,832 x 4
# # Groups: ID [2]
# ID x time avg
# <fctr> <dbl> <date> <dbl>
# 1 A -0.0258 1988-03-15 0.0109
# 2 A -0.0258 1988-03-15 0.0109
# 3 A -0.1562 1988-03-16 0.0115
# 4 A -0.1562 1988-03-16 0.0115
# 5 A 0.8193 1988-03-17 0.0115
# 6 A 0.8193 1988-03-17 0.0112
# 7 A -1.1136 1988-03-18 0.0102
# 8 A -1.1136 1988-03-18 0.0107
# 9 A -0.9336 1988-03-19 0.0105
# 10 A -0.9336 1988-03-19 0.0109
I'm in a hurry so didn't bother translating the syntax to data.table but I think it should be trivial. Solution wasn't checked in depth, treat with caution!
– Moody_Mudskipper
Aug 8 at 17:08
thanks, interesting. map_dbl seems to be a purrr function, so I guess I would need that package. Do you mind explaining what is the idea behind
~which.max(time[time < .x + 3*365.25])) - row_number()+1)
?– Talik3233
Aug 8 at 19:52
~which.max(time[time < .x + 3*365.25])) - row_number()+1)
You can use
sapply
instead of map_dbl
, with standard function notation. The idea behind the code you copied is to find the index of last date available before given date + 3 years , subtract from it the index from given date (+1) to get a time window, and use this window in the width
argument– Moody_Mudskipper
Aug 8 at 22:04
sapply
map_dbl
width
At first glance it seems my solution and @Uwe's are pretty much the same in their principles. His is cleaner and all data.table though so you'll be better off focusing on it.
– Moody_Mudskipper
Aug 8 at 22:09
alright, thanks nontheless for the explanation!
– Talik3233
Aug 9 at 21:12
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
What you mean by the window size can vary for each three-year window?
– A. Suliman
Aug 8 at 15:35