spark aggregation for array column

Multi tool use
Multi tool use

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



spark aggregation for array column



I have a dataframe with a array column.


val json = """[
"id": 1, "value": [11, 12, 18],
"id": 2, "value": [23, 21, 29]
]"""

val df = spark.read.json(Seq(json).toDS)

scala> df.show
+---+------------+
| id| value|
+---+------------+
| 1|[11, 12, 18]|
| 2|[23, 21, 29]|
+---+------------+



Now I need to apply different aggregate functions to the value column.
I can call explode and groupBy, for example


explode


groupBy


df.select($"id", explode($"value").as("value")).groupBy($"id").agg(max("value"), avg("value")).show

+---+----------+------------------+
| id|max(value)| avg(value)|
+---+----------+------------------+
| 1| 18|13.666666666666666|
| 2| 29|24.333333333333332|
+---+----------+------------------+



What bothers me here is that I explode my DataFrame into a bigger one and then reduce it to the original calling groupBy.


groupBy



Is there a better (i.e. more efficient) way to call aggregated functions on array column? Probably I can implement UDF but I don't want to implement all aggregation UDFs myself.









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

mrymmP,SVJ3tG4w g,N49v7qEmJ1SV
A9EFOSnaDmtLITyOP9y

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

Store custom data using WC_Cart add_to_cart() method in Woocommerce 3