custom map scripts and hive

larry ogrodnek - 14 Jul 2009

First, I have to say that after using Hive for the past couple of weeks and actually writing some real reporting tasks with it, it would be really hard to go back. If you are writing straight hadoop jobs for any kind of report, please give hive a shot. You'll thank me.

Sometimes, you need to perform data transformation in a more complex way than SQL will allow (even with custom UDFs). Specifically, if you want to return a different number of columns, or a different number of rows for a given input row, then you need to perform what hive calls a transform. This is basically a custom streaming map task.

The basics

You are not writing an org.apache.hadoop.mapred.Mapper class! This is just a simple script that reads rows from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t). It's probably worth mentioning this again but you shouldn't be thinking Key Value, you need to think about columns.
You can write your script in any language you want, but it needs to be available on all machines in the cluster. Any easy way to do this is to take advantage of the hadoop distributed cache support, and just use add file /path/to/script within hive. The script will then be distributed and can be run as just ./script (assuming it is executable), or perl script.pl if it's perl, etc.

An example

This is a simplified example, but recently I had a case where one of my columns contained a bunch of key/value pairs separated by commas:

k1=v1,k2=v2,k3=v3,...
k1=v1,k2=v2,k3=v3,...
k1=v1,k2=v2,k3=v3,...

I wanted to transform these records into a 2 column table of k/v:

k1  v1
k2  v2
k3  v3
k1  v1
k2  v2
...

I wrote a simple perl script to handle the map, created the 2 column output table, then ran the following:

-- add script to distributed cache
add file /tmp/[split_kv.pl](http://com-bizo-public.s3.amazonaws.com/hive/mapper/split_kv.pl)

-- run transform
insert overwrite table test_kv_split
select
  transform (d.kvs)
    using './split_kv.pl'
    as (k, v)
from
  (select all_kvs as kvs from kv_input) d
;

As you can see, you can specify both the input and output columns as part of your transform statement.

And... that's all there is to it. Next time... a reducer?

comments powered by Disqus

Previous Post Next Post