Best unofficial Apache Server developers community
Username
Forgot password?
Sign in with Twitter account
Sign in with Facebook account

Hive, hadoop, and the mechanics behind hive.exec.reducers.max

0

98 views

In context of this other question here

Using hive.exec.reducers.max directive has truely baffled me.

From my perspective I thought hive worked on some sort of logic like, I have N # of blocks in a desired query so I need N maps. From N I will need some sensible range of reducers R which can be anywhere from R = N / 2 to R = 1. For the hive report I was working on, there was 1200+ maps and without any influence hive made a plan for about 400 reducers which was fine except I was working on a cluster that only had 70 reducers total. Even with the fair job scheduler, this caused a backlog that would hang up other jobs. So I tried a lot of different experiments until I found hive.exec.reducers.max and set it to something like 60.

The results was that a hive job that took 248 minutes, finished in 155 minutes with no changes in the result. What's bothered me is, why not have hive default to N never being greater then the clusters reducer capacity and seeing as I can roll over several terabytes of data with a reduced set of reducers then what hive thinks is correct, is it better to always try and tweak this count?

asked February 17, 2011 12:30 am CST
posted via StackOverflow

2 Answers

0
 

You may want to look at(which talks about optimizing the number of slots): http://wiki.apache.org/hadoop/LimitingTaskSlotUsage

Here is my opinion on the same:

1) Hive would ideally try optimize the number of reducers based on the expected amount of data that gets generated after the map task. It would expect the underlying cluster to be configured to support the same.

2) Regarding whether it may not be a good idea to tweak this count or not:

  • First lets try to analyze what could be the reason for the execution time to come down from 248 minutes to 155 minutes:

Case1: Hive is using 400 reducers Problem: Only 70 reducers can run at a given point of time.

  • Assuming no JVM reuse. Creation of the JVM's again and again would add a large overhead.

  • Not sure on this: Expecting 400 reducers would cause a problem like fragmentation. As in, suppose I know that only 70 reducers can run then my intermediate file storing strategy would be depend on that. But, with 400 reducers the whole strategy goes for a toss.

Case2: Hive is using 70 reducers - Both the problems get addressed by setting this number.

I guess its better to set the number of maximum available reducers. But, I am no expert at this. Would let the experts comment on this.

answered February 18, 2011 4:33 pm CST
1
 

In my experience with hive setting mapred.job.reuse.jvm.num.tasks to a healthy number (in my case, 8) helps with a lot of these ad-hoc queries. It takes around 20-30 seconds to spawn a JVM, so reuse can help out quite a bit with mappers and reducers that are short lived (< 30 seconds).

answered February 26, 2011 9:59 pm CST

Your answer

Join with account you already have


Sign in with Twitter account
Sign in with Facebook account
Sign in with Google Friend Connect

Preview
Similar questions
Hadoop/hive metastore
August 7, 2009
Hive with Lucene
January 31, 2011
What is Hadoop ?
October 11, 2009
Using mahout and hadoop
January 12, 2011
Hadoop Recursive Map
February 10, 2011
Hadoop certification
January 12, 2011
Can Hadoop run on Nginx?
January 14, 2011
Hadoop copy directory
January 17, 2011