CSV and Hive

larry ogrodnek - 12 Nov 2010

CSV

Anyone who's ever dealt with CSV files knows how much of a pain the format actually is to parse. It's not as simple as splitting on commas -- the fields might have commas embedded in them, so, okay you put quotes around the field... but what if the field had quotes in it? Then you double up the quotes... "okay, ""great""" -- that was a single CSV field.

We normally use the excellent opencsv (apache2 licensed) library to deal with CSV files.

Hive

We love Hive. Almost all of our reporting is written as Hive scripts. How do you deal with CSV files with Hive? If you know for sure your fields don't have any commas in them, you can get away with the delimited format. There's the RegexSerDe, but as mentioned the format is non-trivial, and you need to change the regex string depending on how many columns you are expecting.

CSVSerde

Enter the CSVSerde. It's a Hive SerDe that uses the opencsv parser to serialize and deserialize tables properly in the CSV format.

Using it is pretty simple:

add jar path/to/csv-serde.jar;

create table my_table(a string, b string, ...)
  row format serde 'com.bizo.hive.serde.csv.CSVSerde'
  stored as textfile
;

This is my first time writing a Hive SerDe. There were a couple of road bumps, but overall I was surprised with how easy it was. I mostly just followed along with RegexSerDe.

I'm sure there are a lot of ways it could be improved, so I'd appreciate any feedback or comments on how to make it better.

Source.

Binary (jar packaged with opencsv).

comments powered by Disqus