At Bizo, we run a number of periodically scheduled Hive jobs produce a high level summary as just a few (often, just one) row of data. In the past, we’ve simply used the same delivery mechanism as with larger reports; the output is emailed as a CSV file to the appropriate distribution list. This was less than ideal for a number of reasons:
- Managing the distribution lists is difficult. We either needed to create a new list for each type of report, giving us a lot of lists to manage, or just send reports to a generic distribution list, resulting in a lot of unnecessary emails to people who weren’t necessarily interested in the report.
- Handling the historical context is manual; the report needs to pull in past results to include in the output or recipients of the output need to find older emails to see trends appear.
- Report delivery required an additional step in the job workflow outside of the Hive script.
With the GData storage handler, we now just create a Google Spreadsheet, add appropriate column headers, and do something like this in our script:
add jar gdata-storagehandler.jar ;
create external table gdata_output(
day string, cnt int, source_class string, source_method string, thrown_class string
stored by 'com.bizo.hive.gdata.GDataStorageHandler'
with serdeproperties (
"gdata.user" = "firstname.lastname@example.org",
"gdata.consumer.key" = "bizo.com",
"gdata.consumer.secret" = "...",
"gdata.spreadsheet.name" = "Daily Exception Summary",
"gdata.worksheet.name" = "My Application",
"gdata.columns.mapping" = "day,count,class,method,thrown"
This appends whatever data is written to the table to the specified spreadsheet.
The source code is available here. If you’re running your jobs on Amazon’s Elastic MapReduce, you can access the storage handler by adding the following line to your Hive script:
add jar s3://com-bizo-public/hive/storagehandler/gdata-storagehandler-0.1.jar ;
Note that the library only supports 2-legged OAuth access to Google Apps for Domains, which needs to be enabled in your Google Apps control panel.