Most of the data we log is in a simple log file format:
- One record per line, separated by newline.
- Each record can have one or more fields. Fields are separated by ^A (\001).
- Each field is a key/value pair separated by ^B (\002). Field order is not specified.
ts=1298598378404/code=403/message=bad referrer: bizo.com/...(Where / is ^A and = is ^B).
This is a format we chose pretty early on, way before we ever looked at Hive. It turns out to be a great format:
- Human readable.
- Trivial to parse in any language.
- Dynamic -- easy to add/remove fields from your data.
That is, each row is just a single column, which is a map. At first this seemed a little degenerate to me, but it actually models our data perfectly. There are no guarantees about which fields are available, and it's easy to add/remove fields in the data over time. I should mention that this is really just for our report input, typically our report output will be in a fixed format.
create external table api_logs(d map<string,string>)
partitioned by (...)
row format delimited
fields terminated by '\004'
collection items terminated by '\001'
map keys terminated by '\002'
stored as textfile
If you're using Hive 0.6 or greater, with Hive View support it's also easy to get the best of both worlds.
You can even change the type information or transform the data by including a cast or a UDF as part of your view. Creating a view doesn't cause anything to run, or create any additional storage. Its query conditions are basically just merged with subsequent queries on that view.
create view api_errors(ts, code, message) as
select d["ts"], d["code"], d["message"]
where d["code"] >= 400