Moving average with changing window

Moving average with changing window

I would like to calculate the moving average of a variable in R with a changing window size. To be more specific: The moving average should be calculated over three years, but the data (time-series) is available in higher frequency and the window size can vary for each three-year window.

Assume the following dataset:

library(data.table) set.seed(1) # reproduceable data dataset <- data.table(ID=c(rep("A",2208),rep("B",2208)), x = c(rnorm(2208*2)), time=c(seq(as.Date("1988/03/15"), as.Date("2000/04/16"), "day"),seq(as.Date("1988/03/15"), as.Date("2000/04/16"), "day")))

The three-year moving average of the variable x should be calculated for both of the IDs (person A and B). Can this be done with zoo and datatable preferably? But any solution is fine.

zoo

datatable

Please note I know how to do this with a fixed window size, the problem here is the varying window size.

What you mean by the window size can vary for each three-year window?
– A. Suliman
Aug 8 at 15:35

I think the OP means that the window will span three years, but that the number of observations within those years is variable.
– Lyngbakr
Aug 8 at 15:37

If dataset size is not an issue, then create a data.table starting with min date, to max date and join your dataset, then you can use a fixed sliding window size.
– Robert Tan
Aug 8 at 15:37

@Lyngbakr that is what I meant, thank you. Unfortunately the dataset size is an issue, but even if it wasn't, the problem with irregular window size would exist, because in a given year there are not necessarily the same amount of days as in another one (i.e. in leapyears there is always a difference from normal years)?
– Talik3233
Aug 8 at 15:41

I don't know if it works, but what about dataset %>% group_by(ID) %>% mutate(avg=rollapplyr(x,mean,width=3*365,fill=NA)), As you already have daily measuerments without missing values.
– A. Suliman
Aug 8 at 16:44

dataset %>% group_by(ID) %>% mutate(avg=rollapplyr(x,mean,width=3*365,fill=NA))

2 Answers
2

If I understand correctly the OP wants to span exactly 3 years. As this may include leap years, the window size can be either 1095 days or 1096 days.

This can be solved by aggregating in a non-equi join together with lubridate's rollback date arithmetic.

lubridate

library(data.table) library(lubridate) # create 3 years windows for each ID for later non-equi join win <- dataset[, CJ(ID = ID, start = time, unique = TRUE)][ # make sure to pick , end := start %m+% years(3) - days(1)][ # remove windows which end out of date range end <= max(start)] win

ID start end 1: A 1988-03-15 1991-03-14 2: A 1988-03-16 1991-03-15 3: A 1988-03-17 1991-03-16 4: A 1988-03-18 1991-03-17 5: A 1988-03-19 1991-03-18 --- 6638: B 1997-04-13 2000-04-12 6639: B 1997-04-14 2000-04-13 6640: B 1997-04-15 2000-04-14 6641: B 1997-04-16 2000-04-15 6642: B 1997-04-17 2000-04-16

# check window lengths win[, .N, by = .(days = end - start + 1L)]

days N 1: 1095 days 2166 2: 1096 days 4476

# see what happens in leap years win[leap_year(start) & month(start) == 2 & day(start) %in% 28:29, .(start, end, days = end - start + 1L)]

start end days 1: 1992-02-28 1995-02-27 1096 days 2: 1992-02-29 1995-02-27 1095 days 3: 1996-02-28 1999-02-27 1096 days 4: 1996-02-29 1999-02-27 1095 days 5: 1992-02-28 1995-02-27 1096 days 6: 1992-02-29 1995-02-27 1095 days 7: 1996-02-28 1999-02-27 1096 days 8: 1996-02-29 1999-02-27 1095 days

win[leap_year(end) & month(end) == 2 & day(end) %in% 28:29, .(start, end, days = end - start + 1L)]

start end days 1: 1989-03-01 1992-02-29 1096 days 2: 1993-03-01 1996-02-29 1096 days 3: 1997-03-01 2000-02-29 1096 days 4: 1989-03-01 1992-02-29 1096 days 5: 1993-03-01 1996-02-29 1096 days 6: 1997-03-01 2000-02-29 1096 days

# aggregate in a non-equi-join dataset[win, on = .(ID, time >= start, time <= end), by = .EACHI, .(avg = mean(x))]

ID time time avg 1: A 1988-03-15 1991-03-14 -0.01184078 2: A 1988-03-16 1991-03-15 -0.01317813 3: A 1988-03-17 1991-03-16 -0.01179571 4: A 1988-03-18 1991-03-17 -0.01006100 5: A 1988-03-19 1991-03-18 -0.01221798 --- 6638: B 1997-04-13 2000-04-12 -0.03412214 6639: B 1997-04-14 2000-04-13 -0.03604176 6640: B 1997-04-15 2000-04-14 -0.03556291 6641: B 1997-04-16 2000-04-15 -0.03392185 6642: B 1997-04-17 2000-04-16 -0.03393674

thanks this looks perfect to me. Just one question: Could you also formulate this with the usual merge syntax from datatable?
– Talik3233
Aug 8 at 19:56

merge

datatable

No, the goodies like non-equi join and grouping by each i are only available with the [.data.table syntax. data.table's merge() is more or less only a faster implementation of base R's merge(). There is a section in the FAQ which explains the differences.
– Uwe
Aug 8 at 21:45

[.data.table

data.table

merge()

I understand, thanks. I was asking because in my opinion the syntax of merge is cleaner. It's harder to remember the difference between X[Y] and Y[X] imho. But the solution is great, so I guess I need to get used to this.
– Talik3233
Aug 8 at 21:49

merge

Yes, merge() is simpler but less powerful. X[Y] behaves like a right join, i.e., take all rows of Y and try to find matches in X.
– Uwe
Aug 8 at 21:55

merge()

X[Y]

Y

X

There is the package fuzzyjoin that could work as well and has syntax close to merge (and even closer to dplyr joins), but I believe it will be slower.
– Moody_Mudskipper
Aug 9 at 6:25

fuzzyjoin

merge

dplyr

As @A.Suliman mentions in comments your example data has a fixed window width, but let's assume your real data hasn't as that's what your text says.

The parameter width from rollapply doesn't have to be constant, so we can compute the width first, make sure we align the window to the left, and run rollaply.

width

rollapply

rollaply

library(zoo) library(tidyverse) dataset %>% arrange(ID,time) %>% group_by(ID) %>% mutate(avg = rollapply(x, FUN = mean, align = "left", width = map_dbl(time, ~which.max(time[time < .x + 3*365.25])) - row_number()+1)) # # # A tibble: 8,832 x 4 # # Groups: ID [2] # ID x time avg # <fctr> <dbl> <date> <dbl> # 1 A -0.0258 1988-03-15 0.0109 # 2 A -0.0258 1988-03-15 0.0109 # 3 A -0.1562 1988-03-16 0.0115 # 4 A -0.1562 1988-03-16 0.0115 # 5 A 0.8193 1988-03-17 0.0115 # 6 A 0.8193 1988-03-17 0.0112 # 7 A -1.1136 1988-03-18 0.0102 # 8 A -1.1136 1988-03-18 0.0107 # 9 A -0.9336 1988-03-19 0.0105 # 10 A -0.9336 1988-03-19 0.0109

I'm in a hurry so didn't bother translating the syntax to data.table but I think it should be trivial. Solution wasn't checked in depth, treat with caution!
– Moody_Mudskipper
Aug 8 at 17:08

thanks, interesting. map_dbl seems to be a purrr function, so I guess I would need that package. Do you mind explaining what is the idea behind ~which.max(time[time < .x + 3*365.25])) - row_number()+1) ?
– Talik3233
Aug 8 at 19:52

~which.max(time[time < .x + 3*365.25])) - row_number()+1)

You can use sapply instead of map_dbl, with standard function notation. The idea behind the code you copied is to find the index of last date available before given date + 3 years , subtract from it the index from given date (+1) to get a time window, and use this window in the width argument
– Moody_Mudskipper
Aug 8 at 22:04

sapply

map_dbl

width

At first glance it seems my solution and @Uwe's are pretty much the same in their principles. His is cleaner and all data.table though so you'll be better off focusing on it.
– Moody_Mudskipper
Aug 8 at 22:09

alright, thanks nontheless for the explanation!
– Talik3233
Aug 9 at 21:12

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

7joqXvS,Jm,mVBw7L75G,IgO3H,hjvu4xFbpw8tiypbHG rYZG,1DaVmT1YSqrdMzlKjZ,TPw6mtLK

搜尋此網誌

Sfyjdyy