How surrogate keys are handles in hive
Clash Royale CLAN TAG#URR8PPP
How surrogate keys are handles in hive
I know that hive cannot create surrogate keys or is rather difficult. I want to understand how companies have implemented dimensional modeling in their warehouse.
One way I can think of is leaving the dimension details as is in fact. Then move the distinct of dimension to a different table. But then how are scd1 and scd2 handled. I have checked talks by Kimball on cloudera and I still don't understand how this works.
1 Answer
1
There are two ways of handling this problem in Hive.
The first does not directly answer your question, and that is to use natural keys instead of surrogates. While surrogates are more convenient and performant, since you're using Hive I'm guessing that performance isn't one of your major criteria, so the cost of using natural keys will just be in the extra lines of code you have to write to cater for compound keys.
The second way is to use Hive's windowing functions to calculate the surrogate. I don't have a Hive environment handy to test this query, but the surrogate would look something like:
(select max(surrogate_key_column) from dimension_table)
+ row_number() over (order by 1)
I've not used that approach in Hive, but I've used it extensively in other data warehouse environments. I don't understand why multiple tables would prevent natural keys being used ... surely, for example, CustomerA in Table1 would be the same as CustomerA in Table2? If not, then the natural key should also include the concept represented by the separate table. If you can share some more details of your tables it would help to understand the problem better.
– Ron Dunn
Aug 13 at 6:00
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
That can be done. But I would like to know if you people are implementing this in their company. Are you using the second approach in your work? Also using NK approach would not work if we are creating fact from multiple tables. NKs are going to overlap.
– Ravi R
Aug 12 at 10:51