How to auto calculate numRepartition while using spark dataframe write
Clash Royale CLAN TAG#URR8PPP
How to auto calculate numRepartition while using spark dataframe write
When I tried to write dataframe to Hive Parquet Partitioned Table
df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")
It will create a lots of blocks in HDFS, each of the block only have small size of data.
I understand how it goes as each spark sub-task will create a block, then write data to it.
I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.
If i want to auto set numPartition, does anyone have a good idea?
numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
.partitionBy("key")
.format("hive")
.saveAsTable("db.table")
How to master Apache-Spark?
Spark
numPartition
Your blog is amazing, I did implement some of your best practice in df transformation. Your methodology looks great, but in my case I have tons of offline data pipeline, is it a good choice if I ignore the "repartition" part and optimize it afterwards?
– Eric Yiwei Liu
Aug 13 at 15:32
@Eric Yiwei Liu, the said blog is from Umberto Griffo. Its completely acceptable if you want to take things step-by-step, viz. skipping
repartition
now and re-visiting it later. In fact IMHO when your'e dealing with complex frameworks like Spark
, its better to take this route: quickly hack up a preliminary solution and then gradually working up from there to improve it. Recall that Premature Optimization is the Root of all Evil– y2k-shubham
Aug 14 at 4:27
repartition
Spark
@y2k-shubham Absolutely positive, compress CDH warning is working well so far, i think it's alway a good feature for future spark, let us more focus on development.
– Eric Yiwei Liu
Aug 15 at 6:10
1 Answer
1
By Default spark will create 200 Partitions for shuffle operations. so, 200 files/blocks (if the file size is less) will be written to HDFS.
Configure the number of partitions to be created after shuffle based on your data in Spark using below configuration:
spark.conf.set("spark.sql.shuffle.partitions", <Number of paritions>)
ex: spark.conf.set("spark.sql.shuffle.partitions", "5")
, so Spark will create 5 partitions and 5 files will be written to HDFS.
spark.conf.set("spark.sql.shuffle.partitions", "5")
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Its almost as if you've asked
How to master Apache-Spark?
. Choosing the right level of parallelism is the crux of leveragingSpark
's full capabilities among other things. Here's a good start; it boils down to amount of data you are processing: no of columns, type of columns, no of rows etc. It will take time and effort (hit and trials) to arrive at a metric fornumPartition
s. I start with no of rows & size of data (GBs) to predict it, and then fine-tune it from there– y2k-shubham
Aug 13 at 7:26