Performance issue while using Window
Clash Royale CLAN TAG#URR8PPP
Performance issue while using Window
I have a time series data ,I want to get the interval of data in such a way that if 1 is detected in detector column then it will be the end of one interval and start of other interval .I can do this with groupby but I want an alternative method to do so because of the performance issue while using groupby and also simultaneously detecting the interval in such a way if the difference between time for two consecutive rows is greater than or equal to 15.
For simplicity we can take an example like below
time | detector
5 | 0
10 | 0
15 | 0
20 | 0
25 | 1
35 | 0
40 | 0
56 | 0
57 | 0
55 | 0
60 | 1
65 | 0
70 | 0
75 | 0
80 | 1
85 | 0
Output I want is
interval
[5,25]
[25,60]
[40,56]
[60,80]
[80,85]
update 1:
val wAll = Window.partitionBy(col("imei")).orderBy(col("time").asc)
val test= df.withColumn("lead_time", lead("time", 1, null).over(wAll)).withColumn("runningTotal", sum("detector").over(wAll))
.groupBy("runningTotal").agg(struct(min("time"), max("lead_time")).as("interval"))
This is for calculation of data points greater than equal to 15min
val unreachable_df=df
.withColumn("lag_time",lag("time", 1, null).over(wAll))
.withColumn("diff_time",abs((col("time") - col("lag_time"))/60D))
.withColumn("unreachable",when(col("diff_time")>=15.0,0).otherwise(1))
.drop(col("diff_time"))
.drop(col("lag_time"))
.withColumn("runningTotal", sum("unreachable").over(wAll))
.groupBy("runningTotal")
.agg(struct(min("time"), max("time")).as("interval"))
.withColumn("diff_interval",abs((unix_timestamp(col("interval.col1"))-unix_timestamp(col("interval.col2")))))
.filter(col("diff_interval")>0) .drop("diff_interval")
.withColumn("type",lit("Unreachable")).drop("runningTotal")
Then I have merged the two dataframe to get the above result
val merged_df=test.union(unreachable_df).sort(col("interval.col1"))
why dont you post the whole schema of your input df? & the test data.
– tauitdnmd
Aug 28 at 7:12
Your window function doesn’t have a partition column this all of your data will go into one partition. I’d start fixing that.
– eliasah
Aug 28 at 7:12
stackoverflow.com/a/41316277/3415409
– eliasah
Aug 28 at 7:14
@eliasah I have posted just the sample data whereas the original data is a GPS data consist of different ids so I have partitioned on the basis of column id
– experiment
Aug 28 at 7:15
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Can you please supply the code with the performance issue so we can help you work that out ?
– eliasah
Aug 28 at 6:52