Spark HDFS Direct Read vs Hive External table read

Spark HDFS Direct Read vs Hive External table read

We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.

Directory structure:

/data/table1/INGEST_DATE=20180101

/data/table1/INGEST_DATE=20180102

/data/table1/INGEST_DATE=20180103 etc.

Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).

I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details.
Thanks in Advance

Did the answer help in your understanding?
– thebluephantom
Aug 26 at 10:17

1 Answer
1

If you want to select a subset of the columns, then that it is only possible via spark.sql. In your use case I don't think there will be a significant difference.

With Spark SQL you can get Partition pruning automatically.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

J 3c4C24ZieQs,MWf

搜尋此網誌

Sfyjdyy