EMR/Hive: recovering a large number of partitions

larry ogrodnek - 26 Jan 2011

If you try to run "alter table ... recover partitions" on a table with a large number of partitions, you may run into this error:

FAILED: Error in metadata: org.jets3t.service.S3ServiceException: Failed to sanitize XML document destined for handler class org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler null 'null' -- ResponseCode: -1, ResponseStatus: null, RequestId: null, HostId: null
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

There's some discussion in the aws forums. The underlying cause is that it's running out of memory when trying to build the partition list.

A workaround is to increase the HADOOP_HEAPSIZE. This can be done by modifying hadoop-user-env.sh with an EMR bootstrap action. On an m1.large instance, 2G seems to do the trick for us.

Upload a script like the following somewhere in s3:

#!/bin/bash

if [ $# -lt 1 ]; then
  SIZE="2048"
else
  SIZE=$1
fi

echo "HADOOP_HEAPSIZE=${SIZE}" >> /home/hadoop/conf/hadoop-user-env.sh

You can now run this bootstrap action as part of your job:

elastic-mapreduce --create --alive \
  --name "large partitions..." --hive-interactive \
  --num-instances 1 --instance-type m1.large \
  --hadoop-version 0.20 \
  --bootstrap-action s3://<bucket/path>/set-hadoop-heap.sh

You should now be able to load your partitions.

comments powered by Disqus