Best unofficial Apache Server developers community |
|
I just learned about the collect_set() function in Hive, and I started a job on a development 3-node cluster. I only have about 10 GB to process. The job, however, is literally taking forever. I think there might be either a bug in the implementation of collect_set(), a bug in my code, or the collect_set() function is really resource intensive. Here's My SQL for Hive (no pun intended):
There are 4 MR passes. The first took about 30 seconds. The second Map took about 1 minute. And most of the second reduce took about 2 minutes. In the last two hours, it's been incrementing from 97.71% to 97.73%. Is this right? I think there must be some issue. I took a look at the log, and I can't tell whether it's normal. [Sample of log]
I'm pretty new at this, and trying to work with collect_set() and Hive Array is driving me off the deep end. Thanks in advance :)
posted via StackOverflow
|
|
 
|
First thing I would try is getting rid of the sub-select and just join to site_event, then move the event_id filter into the outer where clause and change it to an in(). So something like:
Additionally, I don't know the sizes of each table, but in general in Hive, you want to keep your largest tables (usually your fact table) on the right hand side of joins to reduce memory usage. The reason being that Hive attempts to hold the left hand side of a join in memory, and streams the right hand side in order to accomplish the join. |
![]()  
|
Major fail. My solution below. There was no issue with the COLLECT_SET after all, it just trying to COLLECT all of the items, of which there were infinite. Why? Because I joined on something that wasn't even part of the set. Second join used to be the same ON condition, now it correctly says
|
|
 
|
I'd guess what's happen is that it's producing a Check the performance with I haven't used COLLECT_SET or done any tests, from what your post, that's what I'd first suspect. |