Spark HDFS Direct Read vs Hive External table read

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Spark HDFS Direct Read vs Hive External table read



We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.



Directory structure:



/data/table1/INGEST_DATE=20180101



/data/table1/INGEST_DATE=20180102



/data/table1/INGEST_DATE=20180103 etc.



Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).



I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details.
Thanks in Advance





Did the answer help in your understanding?
– thebluephantom
Aug 26 at 10:17




1 Answer
1



If you want to select a subset of the columns, then that it is only possible via spark.sql. In your use case I don't think there will be a significant difference.



With Spark SQL you can get Partition pruning automatically.






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard