reduce scripts in hive

larry ogrodnek - 06 Oct 2009

In a previous post, I discussed writing custom map scripts in hive. Now, let's talk about reduce tasks.

The basics

As before, you are not writing an org.apache.hadoop.mapred.Reducer class. Your reducer is just a simple script that reads from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t).

Another thing to mention is that you can't run a reduce without first doing a map.

The rows to your reduce script will be sorted by key (you specify which column this is), so that all rows with the same key will be consecutive. One thing that's kind of a pain with hive reducers, is that you need to keep track of when keys change yourself. Unlike a hadoop reducer where you get a (K key, Iterator<V> values), here you just get row after row of columns.

An example

We'll use a similar example to the map script.

We will attempt to condense a table (kv_input) that looks like:

k1 v1
k2 v1
k4 v1
k2 v3
k3 v1
k1 v2
k4 v2
k2 v2
...

into one (kv_condensed) that looks like:

k1 v1,v2
k2 v1,v2,v3
...

The reduce script

#!/usr/bin/perl                                                                                       

undef $currentKey;
@vals=();

while (&lt;STDIN&gt;) {
  chomp();
  processRow(split(/\t/));
}

output();

sub output() {
  print $currentKey . "\t" . join(",", sort @vals) . "\n";
}

sub processRow() {
  my ($k, $v) = @_;

  if (! defined($currentKey)) {
    $currentKey = $k;
    push(@vals, $v);
    return;
  }

  if ($currentKey ne $k) {
    output();
    $currentKey = $k;
    @vals=($v);
    return;
  }

  push(@vals, $v);
}

Please forgive my perl. It's been a long time (I usually write these in java, but thought perl would make for an easier blog example).

As you can see, a lot of the work goes in to just keeping track of when the keys change.

The nice thing about these simple reduce scripts is that it's very easy to test locally, without going through hadoop and hive. Just call your script and pass in some example text separated by tabs. If you do this, you need to remember to sort the input by key before passing into your script (this is usually done by hadoop/hive).

Reducing from Hive

Okay, now that we have our reduce script working, let's run it from Hive.

First, we need to add our map and reduce scripts:

add file [identity.pl](http://com-bizo-public.s3.amazonaws.com/hive/reduce/identity.pl);
add file [condense.pl](http://com-bizo-public.s3.amazonaws.com/hive/reduce/condense.pl);

Now for the real work:

 1 from (
 2    from kv_input
 3     MAP k, v
 4   USING './[identity.pl](http://com-bizo-public.s3.amazonaws.com/hive/reduce/identity.pl)'
 5      as k, v
 6 cluster
 7      by k) map_output
 8 insert overwrite table kv_condensed
 9 reduce k, v
10  using './[condense.pl](http://com-bizo-public.s3.amazonaws.com/hive/reduce/condense.pl)'
11     as k, v
12 ;

This is fairly dense, so I will attempt to give a line by line breakdown:

On line 3 we are specifying the columns to pass to our reduce script from the input table (specified on line 2).

As I mentioned, You must specify a map script in order to reduce. For this example, we're just using a simple identity perl script. On line 5 we name the columns the map script will output.

Line 6 specifies the column which is the key. This is how the rows will be sorted when passed to your reduce script.

Line 8 specifies the columns to pass into our reducer (from the map output columns on line 5).

Finally, line 10 names the output columns from our reducer.

(Here's my full hive session for this example, and an example input file).

I hope this was helpful. Next time, I'll talk about some java code I put together to simplify the process of writing reduce scripts.

comments powered by Disqus

Previous Post Next Post