Performance issue while using Window

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Performance issue while using Window



This the data I have



I have a time series data ,I want to get the interval of data in such a way that if 1 is detected in detector column then it will be the end of one interval and start of other interval .I can do this with groupby but I want an alternative method to do so because of the performance issue while using groupby and also simultaneously detecting the interval in such a way if the difference between time for two consecutive rows is greater than or equal to 15.



For simplicity we can take an example like below


time | detector
5 | 0
10 | 0
15 | 0
20 | 0
25 | 1
35 | 0
40 | 0
56 | 0
57 | 0
55 | 0
60 | 1
65 | 0
70 | 0
75 | 0
80 | 1
85 | 0



Output I want is


interval
[5,25]
[25,60]
[40,56]
[60,80]
[80,85]



update 1:


val wAll = Window.partitionBy(col("imei")).orderBy(col("time").asc)

val test= df.withColumn("lead_time", lead("time", 1, null).over(wAll)).withColumn("runningTotal", sum("detector").over(wAll))
.groupBy("runningTotal").agg(struct(min("time"), max("lead_time")).as("interval"))



This is for calculation of data points greater than equal to 15min


val unreachable_df=df
.withColumn("lag_time",lag("time", 1, null).over(wAll))
.withColumn("diff_time",abs((col("time") - col("lag_time"))/60D))
.withColumn("unreachable",when(col("diff_time")>=15.0,0).otherwise(1))
.drop(col("diff_time"))
.drop(col("lag_time"))
.withColumn("runningTotal", sum("unreachable").over(wAll))
.groupBy("runningTotal")
.agg(struct(min("time"), max("time")).as("interval"))
.withColumn("diff_interval",abs((unix_timestamp(col("interval.col1"))-unix_timestamp(col("interval.col2")))))
.filter(col("diff_interval")>0) .drop("diff_interval")
.withColumn("type",lit("Unreachable")).drop("runningTotal")



Then I have merged the two dataframe to get the above result


val merged_df=test.union(unreachable_df).sort(col("interval.col1"))





Can you please supply the code with the performance issue so we can help you work that out ?
– eliasah
Aug 28 at 6:52





why dont you post the whole schema of your input df? & the test data.
– tauitdnmd
Aug 28 at 7:12





Your window function doesn’t have a partition column this all of your data will go into one partition. I’d start fixing that.
– eliasah
Aug 28 at 7:12





stackoverflow.com/a/41316277/3415409
– eliasah
Aug 28 at 7:14





@eliasah I have posted just the sample data whereas the original data is a GPS data consist of different ids so I have partitioned on the basis of column id
– experiment
Aug 28 at 7:15









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard