<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5261056907132640554</id><updated>2012-02-07T22:19:05.272-08:00</updated><category term='ruby'/><category term='reflection'/><category term='dynect'/><category term='GWT'/><category term='s3'/><category term='ec2 eclipse'/><category term='ec2'/><category term='macosx'/><category term='video standup'/><category term='jersey'/><category term='emr'/><category term='hadoop'/><category term='iam'/><category term='job'/><category term='java kill process'/><category term='cloudwatch'/><category term='firefox plugin'/><category term='hive'/><category term='performance'/><category term='aws'/><category term='Google I/O'/><category term='gslb'/><category term='ebs'/><category term='crowdflower'/><category term='thrift'/><category term='growl'/><category term='xml'/><category term='berkeleydb'/><category term='hackday'/><category term='visualization'/><category term='ant'/><category term='scala'/><category term='java'/><category term='engineering'/><category term='programming'/><category term='culture'/><category term='cloudviz'/><category term='bash'/><category term='OO'/><category term='mongodb'/><category term='ichat'/><category term='ops'/><category term='appengine'/><category term='simple db'/><category term='Google Visualizations'/><category term='salesforce dart'/><category term='unit testing'/><category term='boto'/><category term='capistrano'/><category term='sdbtool'/><category term='udtf'/><title type='text'>bizo developer blog</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://dev.bizo.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default?start-index=101&amp;max-results=100'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>109</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-2126581180088986857</id><published>2012-01-30T19:38:00.001-08:00</published><updated>2012-01-30T19:50:57.437-08:00</updated><title type='text'>work at Bizo (looking for some good engineers)</title><content type='html'>&lt;p&gt;We’re a small, disciplined team that gets a lot done. Our platform processes billions of page views monthly and 100s of terabytes of data so we have lots of fun problems to tackle.  We believe in &lt;a href="http://dev.bizo.com/2011/03/on-building-kick-ass-engineering-team.html"&gt;teamwork and communication&lt;/a&gt;: comments, design reviews, code reviews for every change, weekly tech talks.  We believe in giving developers ownership over projects.  We believe Engineering is more than coding.  We have fun and keep the beer fridge well stocked.&lt;/p&gt;&lt;p&gt;We have customers, are well funded and recently named the forth fastest growing private company in the San Francisco Bay Area.&lt;/p&gt;&lt;p&gt;We are looking for motivated problem solvers with an entrepreneurial / hacker spirit.&lt;/p&gt;&lt;p&gt;If you're a reader of this blog, you already know our technology stack.  Some highlights: Scala, Java, Javscript, Ruby, AWS (pretty much every service), Hadoop/Hive, GWT, MongoDB, Solr, etc.&lt;/p&gt;&lt;p&gt;If you're interested, please &lt;a href="http://careers.stackoverflow.com/jobs/16330/are-you-a-mind-bending-engineer-bizo"&gt;apply on stackoverflow&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-2126581180088986857?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/2126581180088986857/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=2126581180088986857' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2126581180088986857'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2126581180088986857'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2012/01/work-at-bizo-looking-for-some-good.html' title='work at Bizo (looking for some good engineers)'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-3443860205490767668</id><published>2012-01-18T17:39:00.000-08:00</published><updated>2012-01-18T17:42:03.348-08:00</updated><title type='text'>Using GenericUDFs to return multiple values in Apache Hive</title><content type='html'>A basic user defined function (UDF) in Hive is very easy to write: you simply subclass &lt;a href="http://hive.apache.org/docs/current/api/org/apache/hadoop/hive/ql/exec/UDF.html" target="_blank"&gt;org.apache.hadoop.hive.ql.exec.UDF&lt;/a&gt; and implement an evaluate method. &amp;nbsp;We've &lt;a href="http://dev.bizo.com/2009/06/custom-udfs-and-hive.html" target="_blank"&gt;previously written&lt;/a&gt; about this strategy, and it works well for most simple cases.&lt;br /&gt;&lt;br /&gt;The first case where this breaks down is when you want to return multiple values from your UDF. &amp;nbsp;For me, this often arises when we have serialized data stored in a single Hive field and want to extract multiple pieces of information from it.&lt;br /&gt;&lt;br /&gt;For example, suppose we have a simple Person object (leaving out all of the error checking code):&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;case class Person(val firstName: String, val lastName: String)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;object Person {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; def serialize(p: Person): String = {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; p.firstName + "|" + p.lastName&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; }&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; def deserialize(s: String): Person = {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; val parts = s.split("|")&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; Person(parts(0), parts(1))&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; }&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We want to convert a data table containing these serialized objects into one containing firstName and lastName columns.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;create table input(serializedPerson string) ;&lt;/span&gt;&lt;br /&gt;&lt;pre class="code-java" style="background-color: white; line-height: 1.3; overflow-x: auto; overflow-y: auto; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left;"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;load data local inpath ... ;&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;create table output(firstName string, lastName string) ;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;So, what should our UDF and query look like?&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;Using the previous strategy, we could create two separate UDFs:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;insert overwrite table output&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;select firstName(serializedPerson), lastName(serializedPerson)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;from input ;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;Unfortunately, the two invocations will have to separately deserialize their inputs, which could be expensive in less trivial examples. &amp;nbsp;It also requires writing two separate implementation classes whose only difference is which field to pull out of your model object.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;An alternative is to use a GenericUDF and return a struct instead of a simple string. &amp;nbsp;This requires using &lt;a href="http://hive.apache.org/docs/r0.7.1/api/org/apache/hadoop/hive/serde2/objectinspector/package-summary.html" target="_blank"&gt;object inspectors&lt;/a&gt; to specify the input and output types, just like in a UDTF:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;class&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;DeserializePerson&lt;/span&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;extends GenericUDF {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; private var inputInspector: PrimitiveObjectInspector = _&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;&lt;span class="s1"&gt;def&lt;/span&gt; initialize(inputs: Array[ObjectInspector]): StructObjectInspector = {&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;span class="s1"&gt;this&lt;/span&gt;.inputInspector = inputs(0).asInstanceOf[PrimitiveObjectInspector]&lt;/span&gt;&lt;/div&gt;&lt;div class="p2"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="p2"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; val stringOI =&lt;/span&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(STRING)&lt;/span&gt;&lt;/div&gt;&lt;div class="p2"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="p2"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; val outputFieldNames = Seq("firstName", "lastName")&lt;/span&gt;&lt;/div&gt;&lt;div class="p2"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; val outputInspectors = Seq(stringOI, stringOI)&lt;/span&gt;&lt;/div&gt;&lt;div class="p2"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; ObjectInspectorFactory.getStandardStructObjectInspector(&lt;/span&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;outputFieldNames&lt;/span&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;,&lt;/span&gt;&lt;/div&gt;&lt;div class="p2"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;outputInspectors&lt;/span&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;)&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; }&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; def getDisplayString(children: Array[String]): String = {&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; "deserialize(" + children.mkString(",") + ")"&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; }&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; def evaluate(args: Array[DeferredObject]): Object = {&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; val input = inputInspector.getPrimitiveJavaObject(args(0).get)&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; val person = Person.deserialize(input.asInstanceOf[String])&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; Array(person.firstName, person.lastName)&lt;/span&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; }&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;Here, we're specifying that we expect a single primitive object inspector as an input (error handling code omitted) and returning a struct containing two fields, both of which are strings. &amp;nbsp;We can now use the following query:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;create temporary function deserializePerson as 'com.bizo.udf.DeserializePerson' ;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;insert overwrite table output&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;select person.firstName, person.lastName&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;from (&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; select deserializePerson(serializedPerson)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; from input&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;) parsed ;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;This query deserializes the person only once but gives you access to both of the values returned by the UDF.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;Note that this method does not allow you to return multiple rows -- for that, you still need to use a &lt;a href="http://dev.bizo.com/2010/07/extending-hive-with-custom-udtfs.html" target="_blank"&gt;UDTF&lt;/a&gt;.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-3443860205490767668?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/3443860205490767668/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=3443860205490767668' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3443860205490767668'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3443860205490767668'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2012/01/using-genericudfs-to-return-multiple.html' title='Using GenericUDFs to return multiple values in Apache Hive'/><author><name>Darren Lee</name><uri>https://profiles.google.com/107568415606207989360</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh3.googleusercontent.com/-hYxqr8C5FwI/AAAAAAAAAAI/AAAAAAAAAB4/g_fQsD_uYcI/s512-c/photo.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-3723252975151234241</id><published>2012-01-13T15:14:00.000-08:00</published><updated>2012-01-16T10:29:37.073-08:00</updated><title type='text'>Clustering of sparse data using python with scikit-learn</title><content type='html'>&lt;div&gt;&lt;span id="internal-source-marker_0.6221589369233698"&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Coming from a Matlab background, I found sparse matrices to be easy to use and well integrated into the language. However, when transitioning to python’s scientific computing ecosystem, I had a harder time using sparse matrices. This post is intended to help Matlab refugees and those interested in using sparse matricies in python (in particular, for clustering)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Requirements:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;scikit-learn (2.10+)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;numpy (refer to scikit-learn version requirements)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;scipy (refer to scikit-learn version requirements)&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;b id="internal-source-marker_0.6221589369233698"&gt;&lt;span style="font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Sparse Matrix Types:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;There are six types of sparse matrices implemented under scipy: &lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;bsr_matrix -- block sparse row matrix&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;coo_matrix -- sparse matrix in coordinate format&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;csc_matrix -- compressed sparse column matrix&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;csr_matrix -- compressed sparse row matrix&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;dia_matrix -- sparse matrix with diagonal storage&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;dok_matrix -- dictionary of keys based sparse matrix&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;lil_matrix -- row-based linked list sparse matrix&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;For more info see: &lt;b id="internal-source-marker_0.6221589369233698"&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;(&lt;/span&gt;&lt;a href="http://docs.scipy.org/doc/scipy/reference/sparse.html"&gt;&lt;span style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 153); font-weight: normal; vertical-align: baseline; white-space: pre-wrap; "&gt;http://docs.scipy.org/doc/scipy/reference/sparse.html&lt;/span&gt;&lt;/a&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;):&lt;/span&gt;&lt;/b&gt; &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span&gt;When to use which matrix:&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;The following are scenarios when you would want to choose one sparse matrix type over the another:&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;i&gt;Fast Arithmetic Operation&lt;/i&gt;:   &lt;/b&gt;&lt;span style="text-decoration: none; vertical-align: baseline; "&gt;csc_matrix&lt;/span&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;, csr_matrix&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;span style="font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; font-size: 15px; "&gt;&lt;i&gt;Fast Column Slicing (e.g., A[:, 1:2]):   &lt;/i&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; "&gt;csc_matrix&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;i&gt;Fast Row Slicing (e.g., A[1:2, :])   &lt;/i&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; "&gt;csr_matrix&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Fast Matrix vector products:   &lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;csr_matrix, bsr_matrix, csc_matrix&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Fast Changing of sparsity (e.g., adding entries to matrix):   &lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;lil_matrix, dok_matrix&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Fast conversion to other sparse formats:   &lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; "&gt;coo_matrix&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;b style="font-size: 15px; white-space: pre-wrap; "&gt;&lt;i&gt;Constructing Large Sparse Matrices:  &lt;/i&gt;&lt;/b&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; "&gt;coo_matrix&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span id="internal-source-marker_0.6221589369233698"&gt;&lt;span style="font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;b&gt;Clustering with scikit-learn:&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;b id="internal-source-marker_0.4135049181059003"&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;With the release of scikit-learn 2.10, one of the useful new features is the support for sparse matrices with the k-means algorithm. The following is how you would use sparse matrices with k-means:&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="white-space: pre-wrap; "&gt;&lt;b&gt;Full Matrix to Sparse Matrix&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;b&gt;Example-1&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;-----------------------------------------------------------------------------------------------&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;blockquote style="font-size: 15px; white-space: pre-wrap; "&gt;&lt;/blockquote&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;from numpy.random import random &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;from scipy.sparse import * &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;from sklearn.cluster import KMeans  &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;span&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;span&gt;&lt;b&gt;# create a 30x1000 dense matrix random matrix. &lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;D = random((30,1000)) &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;b&gt;# keep entries with value &amp;lt; 0.10 (10% of entries in matrix will be non-zero)&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;b&gt;# X is a "full" matrix that is intrinsically sparse.&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;X = D*(D&amp;lt;0.10)&lt;span&gt;&lt;b&gt; # note: element wise mult&lt;/b&gt;&lt;/span&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;span&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;span&gt;&lt;b&gt;# convert D into a sparse matrix (type coo_matrix) &lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;span&gt;&lt;b&gt;# note: we can initialize any type of sparse matrix. &lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;span&gt;&lt;b&gt;#           There is no particular motivation behind using &lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;span&gt;&lt;b&gt;#            coo_matrix for this example.&lt;/b&gt;&lt;/span&gt; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;S = coo_matrix(X)   &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;labeler = KMeans(k=3) &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;&lt;b&gt;# convert coo to csr format &lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;&lt;b&gt;# note: Kmeans currently only works with CSR type sparse matrix&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt; &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;labeler.fit(S.tocsr())  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;&lt;span&gt;# print cluster assignments for each row&lt;/span&gt;&lt;/b&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt; &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;for (row, label) in enumerate(labeler.labels_):   &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: 15px; white-space: pre-wrap; font-family: Arial; "&gt;  print "row %d has label %d"%(row, label)  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;-----------------------------------------------------------------------------------------------&lt;/span&gt; &lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;One of the issues with Example-1 is that we are constructing a sparse matrix from a full matrix. It will often be the case that we will not be able to fit a full (although intrinsically sparse) matrix in memory. For example, if the matrix X was a 100000x1000000000 full matrix, there could be some issues. One solution to this is to somehow extract out the non-zero entries of X and to use a smarter constructor for the sparse matrix. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; white-space: pre-wrap; "&gt;&lt;b&gt;Sparse Matrix Construction&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;In Example-2, we will assume that we have X's data stored on some file on disk. In particular, we will assume that X is stored in a csv file and that we are able to extract out the non-zero data efficiently.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;b&gt;Example-2&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;-------------------------------------------------&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;b id="internal-source-marker_0.03044087067246437"&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;import csv&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;from scipy.sparse import *&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;from sklearn.cluster import KMeans&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;def extract_nonzero(fname):&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  """&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  extracts nonzero entries from a csv file&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  input: fname (str) -- path to csv file&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  output: generator&amp;lt;(int, int, float)&amp;gt; -- generator&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;          producing 3-tuple containing (row-index, column-index, data)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  """&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  for (rindex,row) in enumerate(csv.reader(open(fname))):&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;    for (cindex, data) in enumerate(row):&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;      if data!="0":&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;        yield (rindex, cindex, float(data))&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;def get_dimensions(fname):&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  """&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  determines the dimension of a csv file&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  input: fname (str) -- path to csv file&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  output: (nrows, ncols) -- tuple containing row x col data&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  """&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  rowgen = (row for row in csv.reader(open(fname)))&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt; &lt;span&gt; # compute col size&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  colsize = len(rowgen.next())&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt; &lt;span&gt; # compute row size&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;span&gt; &lt;/span&gt; rowsize = 1 + sum(1 for row in rowgen)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  return (rowsize, colsize)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;b id="internal-source-marker_0.03044087067246437"&gt;&lt;span&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;# obtain dimensions of data&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;(rdim, cdim) = get_dimensions("X.csv")&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;# allocate a lil_matrix of size (rdim by cdim)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;# note: lil_matrix is used since we be modifying&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;#       the matrix a lot.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;S = lil_matrix((rdim, cdim))&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;# add data to S&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;for (i,j,d) in extract_nonzero("X.csv"):&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  S[i,j] = d&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;# perform clustering&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; font-weight: normal; "&gt;labeler = KMeans(k=3)&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;# convert lil to csr format&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;# note: Kmeans currently only works with CSR type sparse matrix&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;labeler.fit(S.tocsr()) &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;# print cluster assignments for each row&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;for (row, label) in enumerate(labeler.labels_):&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 15px; font-family: Arial; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;  print "row %d has label %d"%(row, label)&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;--------------------------------------------------&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="white-space: pre-wrap; "&gt;&lt;b&gt;What to do when Sparse Matrices aren't supported:&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span style="font-size: 15px; white-space: pre-wrap;"&gt;When sparse matrices aren't supported, one solution is to convert the matrix to a full matrix. To do this, simply invoke the todense() method.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-3723252975151234241?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/3723252975151234241/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=3723252975151234241' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3723252975151234241'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3723252975151234241'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2012/01/clustering-of-sparse-data-using-python.html' title='Clustering of sparse data using python with scikit-learn'/><author><name>Tony</name><uri>http://www.blogger.com/profile/03721589851872226027</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7057680909116452373</id><published>2012-01-13T12:04:00.000-08:00</published><updated>2012-01-13T12:04:00.091-08:00</updated><title type='text'>Hudson/Jenkins With RVM and PhantomJS</title><content type='html'>Setting up Hudson/Jenkins to work &lt;a href="http://beginrescueend.com/" target="_blank"&gt;RVM&lt;/a&gt; (Ruby Version Manager) and &lt;a href="http://www.phantomjs.org/" target="_blank"&gt;PhantomJS&lt;/a&gt;  (for headless JavaScript testing) can be painful. This post will show you how to easily set them up on your own server.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;RVM&lt;/span&gt; &lt;br /&gt;At &lt;a href="http://bizo.com/" target="_blank"&gt;Bizo&lt;/a&gt; we have several projects that have dependencies on different versions of Ruby, mostly due to some projects relying on older gems which are incompatible with Ruby 1.9. Installing RVM on a dev machine is almost always a cinch but getting it to play nicely with your ci build server isn't quite so straightforward. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We run our Hudson server off of an Amazon EC2 instance. Our EC2 instances are started up with custom software, but&amp;nbsp; it really boils down to executing a bash start up script. Assuming the Hudson user's $HOME is set to /var/lib/hudson, you can copy/paste&amp;nbsp; the code below to install RVM for you. Otherwise just replace /var/lib/hudson below to the $HOME dir of your Hudson (or Jenkins) user.&lt;br /&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;# RVM&lt;br /&gt;COMMANDS=$(cat &amp;lt;&amp;lt;EOS&lt;br /&gt;bash -s stable &amp;lt; &amp;lt;(curl -s https://raw.github.com/wayneeseguin/rvm/master/binscripts/rvm-installer)&lt;br /&gt;echo "[ -s \"/var/lib/hudson/.rvm/scripts/rvm\" ] &amp;amp;&amp;amp; source \"/var/lib/hudson/.rvm/scripts/rvm\" # loads RVM" &amp;gt; .bashrc&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;&amp;nbsp;&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;# ensure RVM is loaded&lt;br /&gt;source ~/.bashrc&lt;br /&gt;&lt;br /&gt;echo "Installing Ruby 1.9.2"&lt;br /&gt;rvm install 1.9.2&amp;nbsp;&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;rvm use 1.9.2&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;echo "Installing gems for Ruby: 1.9.2"&lt;br /&gt;gem install bundler --no-rdoc --no-ri&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;&amp;nbsp;&lt;/code&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;# add additional ruby versions here&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;EOS&lt;br /&gt;)&lt;br /&gt;su - hudson -c "$COMMANDS"&lt;/code&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Then in your Hudson build go to your project configuration and under "execute shell" you can invoke rvm and run your project like normal. Note* our version of Hudson doesn't automatically load .bashrc, so you might need to source it first to ensure RVM loads, ex: &lt;br /&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;source ~/.bashrc&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;&amp;nbsp;&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;# Pick our ruby version&lt;br /&gt;rvm use 1.9.2&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;# Run your project... ex bundle install &amp;amp;&amp;amp; rake test:units for a Rails project&amp;nbsp;&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;/pre&gt;&lt;span style="font-size: large;"&gt;PhantomJS&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: small;"&gt;&lt;a href="http://www.phantomjs.org/" target="_blank"&gt;PhantomJS&lt;/a&gt; is our execution environment of choice for running JavaScript unit tests and setting it up to run on Hudson is actually quite easy.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: small;"&gt;Here is the necessary bash snippet to make it available for use in Hudson.&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;INSTALL_PATH= # wherever you want&lt;br /&gt;wget http://phantomjs.googlecode.com/files/phantomjs-1.4.1-linux-x86-dynamic.tar.gz ${INSTALL_PATH}/phantomjs.tar.gz&lt;br /&gt;# OR For 64bit machines wget http://phantomjs.googlecode.com/files/phantomjs-1.4.1-linux-x86_64-dynamic.tar.gz&lt;br /&gt;&lt;br /&gt;mkdir ${INSTALL_PATH}/phantomjs&lt;br /&gt;tar -zxvf ${INSTALL_PATH}/phantomjs.tar.gz -C ${INSTALL_PATH}&lt;br /&gt;&lt;br /&gt;ln -s ${INSTALL_PATH}/phantomjs/bin/phantomjs /usr/local/bin/phantomjs&lt;br /&gt;&lt;/code&gt;&amp;nbsp;&lt;br /&gt;&lt;/pre&gt;After running the script above you can invoke PhantomJS as "phantomjs" in the "execute shell" box inside your project configuration. You'll probably want your tests to fail with a non-0 exit status so the Hudson build will fail, if you use the &lt;a href="http://pivotal.github.com/jasmine/" target="_blank"&gt;Jasmine&lt;/a&gt; testing framework you can use our phantom-jasmine test runner on Github: &lt;a href="https://github.com/jcarver989/phantom-jasmine" target="_blank"&gt;https://github.com/jcarver989/phantom-jasmine&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7057680909116452373?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7057680909116452373/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7057680909116452373' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7057680909116452373'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7057680909116452373'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2012/01/hudsonjenkins-with-rvm-and-phantomjs.html' title='Hudson/Jenkins With RVM and PhantomJS'/><author><name>Josh Carver</name><uri>http://www.blogger.com/profile/15167764329841650102</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7773138442352997997</id><published>2012-01-13T10:56:00.000-08:00</published><updated>2012-01-13T10:56:29.378-08:00</updated><title type='text'>Interactive Hive sessions, Elastic MapReduce, and GNU screen</title><content type='html'>One extremely annoying quality of using Hive interactively on EMR (or any other remote system) is that your sessions will die if you lose your connection to the server. &amp;nbsp;Once this happens, your ssh session will end, terminating both your Hive session and any queries that may currently be running.&lt;br /&gt;&lt;br /&gt;In most cases, this happens when I'm waiting for a query to execute and I need to move from one place to another, whether from my desk to a conference room or from the office to home. &amp;nbsp;When I can predict (or know) that I'm going to lose my connection and just want to be able to reconnect to Hive later, the best option I've found is to run Hive inside of &lt;a href="http://www.gnu.org/software/screen/"&gt;GNU screen&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I'm definitely a screen newbie, but there are really only three things you need to know:&lt;br /&gt;&lt;br /&gt;1. As soon as you log in for the first time, install screen and start it up:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;sudo apt-get install screen&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;screen&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;hive&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;2. When you're (temporarily) done interacting with Hive and want to stick it in the background, tell screen to detach Hive from your current session by pressing "&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;Ctrl-a&lt;/span&gt;" then "&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;d&lt;/span&gt;". &amp;nbsp;You may now log out from your EMR node.&lt;br /&gt;&lt;br /&gt;3. When you're ready to resume your Hive session, simply log back on and tell screen to reconnect to the most recent session:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;screen -r&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Screen does a whole lot of other stuff, but simply allowing graceful reconnection to Hive sessions is definitely worth the price of entry.&lt;br /&gt;&lt;br /&gt;For comparison, some other things you could do to work around this problem are using &lt;a href="http://en.wikipedia.org/wiki/Nohup"&gt;nohup&lt;/a&gt;, suspending/putting the job in the background and using disown, or using an even more advanced tool like &amp;nbsp;&lt;a href="http://tmux.sourceforge.net/"&gt;tmux&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7773138442352997997?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7773138442352997997/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7773138442352997997' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7773138442352997997'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7773138442352997997'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2012/01/interactive-hive-sessions-elastic.html' title='Interactive Hive sessions, Elastic MapReduce, and GNU screen'/><author><name>Darren Lee</name><uri>https://profiles.google.com/107568415606207989360</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh3.googleusercontent.com/-hYxqr8C5FwI/AAAAAAAAAAI/AAAAAAAAAB4/g_fQsD_uYcI/s512-c/photo.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8620564667072508386</id><published>2012-01-09T05:08:00.000-08:00</published><updated>2012-01-09T05:08:01.440-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cloudwatch'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><title type='text'>How to Measure Latency Distribution using Amazon CloudWatch</title><content type='html'>By default, web services hosted in AWS behind an Elastic Load Balancer have their response rates automatically tracked in CloudWatch. &amp;nbsp;This lets you easily monitor the minimum, maximum, and average latency of your services.&lt;br /&gt;&lt;br /&gt;Unfortunately, none of these metrics are very useful. &amp;nbsp;Minimum and maximum latencies really measure outliers, and averages can easily obscure what's really going on in your services.&lt;br /&gt;&lt;br /&gt;To get a better picture of the performance of our backend services, we explicitly have client services track the latency distribution of servers using CloudWatch custom metrics. &amp;nbsp;Instead of having a single metric that measures the number of milliseconds taken by each request, we instead count the number of requests that take a particular number of milliseconds.&lt;br /&gt;&lt;br /&gt;More specifically, for each service, we create eleven custom metrics representing the buckets 0-10ms, 11-20ms, 21-30ms, ... , 91-100ms, and 101+ms. &amp;nbsp;(We actually have another set of buckets covering the 100-1000ms range at 100ms intervals, but we've found these to be less useful.) &amp;nbsp;For each request, we then increment the counter for the bucket corresponding to the request's latency. &amp;nbsp;Periodically, an automated process simply pulls the aggregated counts from CloudWatch into a Google Spreadsheet and graph the results in a stacked bar chart.&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-UCCLjyg-6Hk/TwfCpe-6sBI/AAAAAAAAAA0/Hhg2hvju9SA/s1600/Screen+Shot+2012-01-06+at+7.56.03+PM.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="276" src="http://1.bp.blogspot.com/-UCCLjyg-6Hk/TwfCpe-6sBI/AAAAAAAAAA0/Hhg2hvju9SA/s640/Screen+Shot+2012-01-06+at+7.56.03+PM.png" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;A stacked bar chart of response times for one of our services. &amp;nbsp;Each layer represents a different response time bucket.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;This gives us a nice visual representation of both the overall traffic levels as well as how many of them had response times below each threshold.&lt;br /&gt;&lt;br /&gt;Our automated process also converts the totals into percentiles.&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-39pHt77ebE4/TwfD0MESCQI/AAAAAAAAAA8/sh5L-HG93Vo/s1600/Screen+Shot+2012-01-06+at+8.02.01+PM.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="256" src="http://2.bp.blogspot.com/-39pHt77ebE4/TwfD0MESCQI/AAAAAAAAAA8/sh5L-HG93Vo/s640/Screen+Shot+2012-01-06+at+8.02.01+PM.png" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;A stacked bar chart of response times as percentiles.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Google's chart tools support displaying only a vertical slice of the data, so we can easily show 90th, 95th, and 97.5th percentiles of our response times. &amp;nbsp;This lets us easily see whether we're fulfilling performance requirements like "95% of all requests with response times below N ms."&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-Nif_xxCJZOw/TwfEtUMFlMI/AAAAAAAAABE/d1KwLuxjn68/s1600/Screen+Shot+2012-01-06+at+8.05.52+PM.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="268" src="http://2.bp.blogspot.com/-Nif_xxCJZOw/TwfEtUMFlMI/AAAAAAAAABE/d1KwLuxjn68/s640/Screen+Shot+2012-01-06+at+8.05.52+PM.png" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Our slowest responses.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The combination of latency bucketing, Amazon CloudWatch, and Google Spreadsheets gives us a very lightweight way of tracking our server performance. &amp;nbsp;The only additional overhead on our servers is a bit of logic to do local aggregation and push data into CloudWatch, and the only other moving part is a simple cron task that connects together the CloudWatch and Google Data APIs.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8620564667072508386?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8620564667072508386/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8620564667072508386' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8620564667072508386'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8620564667072508386'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2012/01/how-to-measure-latency-distribution.html' title='How to Measure Latency Distribution using Amazon CloudWatch'/><author><name>Darren Lee</name><uri>https://profiles.google.com/107568415606207989360</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh3.googleusercontent.com/-hYxqr8C5FwI/AAAAAAAAAAI/AAAAAAAAAB4/g_fQsD_uYcI/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-UCCLjyg-6Hk/TwfCpe-6sBI/AAAAAAAAAA0/Hhg2hvju9SA/s72-c/Screen+Shot+2012-01-06+at+7.56.03+PM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-2135277626725753624</id><published>2011-12-19T13:54:00.000-08:00</published><updated>2011-12-19T13:59:24.755-08:00</updated><title type='text'>Get Your Company To Blog More With A Game</title><content type='html'>&lt;p&gt;At Bizo we try to blog fairly often. But writing blog posts with any degree of frequency at a startup is tough - there are often ten other important tasks that needed to be done yesterday. Finding the time to sit down and write a post when you have features to build, code to review and the occasional meeting is difficult to say the least.&lt;/p&gt;&lt;p&gt;We needed something that would encourage us to blog more frequently, and there's no better way to motivate a bunch of engineers than a game. So during our last Open Source Day we built a Blog Scoreboard that ranks authors based on the number of posts and comments they have. It's setup on our office big screen tv serving as constant reminder that &lt;a href="http://twitter.com/#!/ogrodnek/"&gt;Larry&lt;/a&gt; is by a landslide, the king of blogging. For a live demo of the scoreboard and to see just how much Larry is dominating us, checkout &lt;a href="http://blog-leaderboard.heroku.com"&gt;the live example here&lt;/a&gt;.&lt;/p&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-gUuKvYYu5OY/Tu-vmkc4hVI/AAAAAAAAAJM/gZ4WgMTMzPI/s1600/blog-leaderboard.jpg" imageanchor="1"&gt;&lt;img border="0" height="163" src="http://4.bp.blogspot.com/-gUuKvYYu5OY/Tu-vmkc4hVI/AAAAAAAAAJM/gZ4WgMTMzPI/s320/blog-leaderboard.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;p&gt;Currently we just optimize towards the volume of posts so the points are currently assigned as: 10 points per blog post and 1 point per comment. The weighting scheme is intentionally naive due since this had to be built in a few hours and will undoubtably change as time goes on&lt;/p&gt;&lt;p&gt;The best part about our scoreboard is that it is an open source sinatra (ruby) app, and it works with any Blogger blog! All you have to do is edit a few lines of YAML and you'll have your very own big screen blog scoreboard. You can grab the source code and install instructions on github: &lt;a href="https://github.com/jcarver989/blog-scoreboard"&gt;https://github.com/jcarver989/blog-scoreboard&lt;/a&gt; Happy blogging!&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-2135277626725753624?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/2135277626725753624/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=2135277626725753624' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2135277626725753624'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2135277626725753624'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/12/get-your-company-to-blog-more-with-game.html' title='Get Your Company To Blog More With A Game'/><author><name>Josh Carver</name><uri>http://www.blogger.com/profile/15167764329841650102</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-gUuKvYYu5OY/Tu-vmkc4hVI/AAAAAAAAAJM/gZ4WgMTMzPI/s72-c/blog-leaderboard.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-5528854945084445195</id><published>2011-12-15T13:58:00.000-08:00</published><updated>2011-12-15T14:40:50.302-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><category scheme='http://www.blogger.com/atom/ns#' term='emr'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><title type='text'>4 tips from the trenches of Amazon Elastic MapReduce and Hive</title><content type='html'>Some things you can learn from others, and some things you can only learn from experience. &amp;nbsp;In an effort to move some knowledge from the latter category into the former, here are four things we've learned the hard way while working with big data flows using Hive on Amazon's Elastic MapReduce:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;1. To prevent data loss, make sure Hive knows if it owns the data.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;All of our data processing follows a simple pattern: servers write logs to S3, and we use the basic EMR/S3 integration to read this data in our Hive scripts. &amp;nbsp;A typical table definition could look something like this:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;create external table sample_data(d map&amp;lt;string,string&amp;gt;)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; comment 'data logs uploaded from servers'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; row format delimited&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; fields terminated by '\004'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; collection items terminated by '\001'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; map keys terminated by '\002'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; stored as textfile&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; location 's3://log-bucket/hive/data/files/sample_data/'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;(The single map column is our way of supporting &lt;a href="http://dev.bizo.com/2011/02/columns-in-hive.html"&gt;dynamic columns in Hive&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;The most easily overlooked part of this table definition is the "&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;external&lt;/span&gt;" keyword, which tells Hive that the actual underlying data is managed by some other process. &amp;nbsp;If you forget to add this keyword and later issue a "&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;drop table&lt;/span&gt;" command, Hive will happily nuke all of your log files.&lt;br /&gt;&lt;br /&gt;This can be especially troublesome while doing ad hoc analysis, as these usually involve interactive queries in the Hive console with a workflow that often involves an exploratory style of development that includes deleting and recreating tables.&lt;br /&gt;&lt;br /&gt;One common use pattern where dropping tables appears is when running scripts that include a create statement. &amp;nbsp;Normally, trying to recreate a table that already exists in Hive will cause an error, so a script may preemptively issue a drop command in case that table already exists.&amp;nbsp; An alternative is to change the table definition to tell Hive to ignore create statements for preexisting tables:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;create external table if not exists sample_data(d map&amp;lt;string,string&amp;gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In the event that you accidentally forget to specify "&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;external&lt;/span&gt;" in your table definition, you can add it later by altering the table:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;alter table sample_data set tblproperties ('EXTERNAL'='TRUE') ;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Note that the capitalization in the table properties is significant. &amp;nbsp;Additionally, this feature is only available in Hive 0.6.0+. &amp;nbsp;(It was present but &lt;a href="https://issues.apache.org/jira/browse/HIVE-1329"&gt;buggy&lt;/a&gt; prior to 0.6.0.)&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;2. Use a smart partitioning scheme to create flexible views of the data.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Hive supports partitioning, which can be a huge performance win when querying only a subset of the entire dataset. &amp;nbsp;This is critical when the entire dataset consists of years of request information but you're only interested in analyzing one day of traffic. &amp;nbsp;To take advantage of this, we upload our data into hourly paths in S3:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;s3://log-bucket/log-name/year=${year}/month=${month}/day=${day}/hour=${hour}/${logfile}.log.gz&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We would then load this data with the following statements:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;create external table sample_data(d map&amp;lt;string,string&amp;gt;)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; comment 'data logs uploaded from servers'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; partitioned by (&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; year string,&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; month string,&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; day string,&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; hour string)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; row format delimited&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; fields terminated by '\004'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; collection items terminated by '\001'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; map keys terminated by '\002'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; stored as textfile&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; location 's3://log-bucket/log-name/'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;alter table sample_data recover partitions ;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The year/month/day/hour partitions are now available to be used in select/where statements just like columns.&lt;br /&gt;&lt;br /&gt;We could (and originally did) just use a single partition column for the entire date; however, using multiple partition columns allows Hive to complete ignore the presence of other partitions. &amp;nbsp;The above statement will need to recover metadata for 8-9 thousand partitions of data, which (while less troublesome than not partitioning at all) will still require a lot of time and memory. &amp;nbsp;Multiple partition columns lets us create (for example) a "view" of a single month:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;create external table sample_data_2011_12(d map&amp;lt;string,string&amp;gt;)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; comment 'data logs uploaded from servers'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; partitioned by (&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; day string,&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; hour string)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; row format delimited&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; fields terminated by '\004'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; collection items terminated by '\001'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; map keys terminated by '\002'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; stored as textfile&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; location 's3://log-bucket/log-name/year=2011/month=12/'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;alter table sample_data recover partitions ;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Now Hive only needs to recover 7-8 hundred partitions. &amp;nbsp;We use a similar strategy for our daily reporting, which makes recovering partition data even faster.&lt;br /&gt;&lt;br /&gt;The only time this scheme breaks down is when the report boundaries don't align with the partition boundaries. &amp;nbsp;In these cases, you can still get the benefits of partitioning by manually adding the partition information to the table. &amp;nbsp;For example, to do efficient queries on a week of data, we would replace the "recover partitions" statement with a sequence of statements like these (and tweak the table definition to use only a single partition column):&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;alter table sample_data add partition(day_hour=2011120600) location "s3://log-bucket/log-name/year=2011/month=12/day=06/hour=00/";&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;3. Use S3 multipart uploads for large objects.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;While S3, Elastic MapReduce, and Hive theoretically allow you easily scale your data storage and analytics, we regularly run up against operational limits as our processing requirements grow. &amp;nbsp;One surprising problem we recently ran into was S3 throttling EMR because our cluster was accessing the source data too quickly. &amp;nbsp;After some back-and-forth with support, they suggested a workaround of uploading the source objects (which were multiple GB in size and created by a previous EMR job) with multi-part uploads.&lt;br /&gt;&lt;br /&gt;Enabling multi-part uploads in EMR is simply a matter of &lt;a href="http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?Config_Multipart.html"&gt;flipping a configuration switch&lt;/a&gt; in the cluster configuration. &amp;nbsp;When starting a cluster from the command line, simply add the following options (taken from the linked documentation):&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;elastic-mapreduce --create --alive \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;--bootstrap-name "enable multipart upload" \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;--args "-c,fs.s3n.multipart.uploads.enabled=true, \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;-c,fs.s3n.multipart.uploads.split.size=524288000"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;4. Spot instances can dramatically reduce your data costs.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As our data processing requirements have grown, so have our cluster sizes. &amp;nbsp;We've found that we can reduce our costs by over 50% by using a mix of on-demand and spot instances.&lt;br /&gt;&lt;br /&gt;Each Elastic MapReduce job can have up to three instance groups: master, core, and task. &amp;nbsp;All data resides on the master and core instances, so jobs can use spot instances to add processing power and additional IO with very little risk. &amp;nbsp;You can add these instances at startup by simply specifying a bid price:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;elastic-mapreduce --create --alive --hive-interactive --hive-versions 0.7 \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;--instance-group master --instance-type m1.xlarge --instance-count 1 \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;--instance-group core --instance-type m1.xlarge --instance-count 9 \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;--instance-group task --instance-type m1.xlarge --instance-count 90 --bid-price 0.25&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For ad hoc queries, I've also found it easy to prototype my job on a subset of data using only the master and core instances, then add the task instances once I'm satisfied that the query is correct. &amp;nbsp;New task instances are available as Hadoop workers as soon as they're added to the cluster, even if you're already running your MapReduce job. &amp;nbsp;You can modify the size of the task group using the command-line client:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;elastic-mapreduce --jobflow j-xxxxxxxxxxxxx --add-instance-group task &amp;nbsp;--instance-type m1.xlarge --instance-count 90 --bid-price 0.25&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;elastic-mapreduce --jobflow j-xxxxxxxxxxxxx --set-num-task-group-instances 90&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that while the size of the task group can be changed, you cannot modify the instance type or the bid price after the task group is created.&lt;/div&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-5528854945084445195?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/5528854945084445195/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=5528854945084445195' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5528854945084445195'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5528854945084445195'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/12/4-tips-from-trenches-of-amazon-elastic.html' title='4 tips from the trenches of Amazon Elastic MapReduce and Hive'/><author><name>Darren Lee</name><uri>https://profiles.google.com/107568415606207989360</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh3.googleusercontent.com/-hYxqr8C5FwI/AAAAAAAAAAI/AAAAAAAAAB4/g_fQsD_uYcI/s512-c/photo.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7026112140242556504</id><published>2011-12-13T11:55:00.000-08:00</published><updated>2011-12-13T11:55:54.835-08:00</updated><title type='text'>Promises in Javascript/Coffeescript</title><content type='html'>&lt;p&gt;This happens often. You’re humming along writing some awesome Javascript code. At first everything is neat and organized. Then you add a feature here, an AJAX call there and before your once lovely codebase has turned into callback spaghetti.&lt;/p&gt;&lt;p&gt;One scenario where this can become particularly nasty is when you have an arbitrary number of asynchronous tasks, one of which cant run until all the others have completed. Suppose for example you are creating a page that displays information on your organization’s engineering team and open source projects hosted on Github. This might require a few api calls to Github (eg. one to get the repositories, another to get the team members) that can be executed in parallel. We could just append the information to the DOM as each api call finishes, but it would be nice to avoid pieces of the site popping in at different times. Instead it would be ideal if we made the api calls to Github and only after both api calls had completed would we render the information to the DOM. The problem is that in Javascript trying to implement this can get quite messy.&lt;/p&gt;&lt;h3&gt;A Bad Solution&lt;/h3&gt;&lt;p&gt;The issue here is knowing when it is safe to execute the task that renders the Github information to the page. One strategy might be to store the results from each api call in an array and then wait for a set amount of time before drawing the page.&lt;/p&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;results = []&lt;br /&gt;&lt;br /&gt;githubApiCallOne (response) -&gt; &lt;br /&gt;  results.push(response)&lt;br /&gt;&lt;br /&gt;githubApiCallTwo (response) -&gt;&lt;br /&gt;  results.push(response)&lt;br /&gt;&lt;br /&gt;setTimeout(() -&gt;&lt;br /&gt;  drawPage(results)&lt;br /&gt;, 5000)&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;This turns out to be a poor solution as we can either end up waiting too little, in which case the program would fail or we could set the timeout too high making our users wait longer than necessary to load our page.&lt;/p&gt;&lt;h3&gt;Promises to The Rescue&lt;/h3&gt;&lt;p&gt;Fortunately there is a better way to implement our Github page. Using a construct called a Promise (sometimes called a Future) allows us to elegantly handle these types of situations. Using promises we can turn our code into something like this:&lt;/p&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;Promise.when(&lt;br /&gt;  githubApiCallOne(),&lt;br /&gt;  githubApiCallTwo()&lt;br /&gt;).then((apiCallOneData, apiCallTwoData) -&gt;&lt;br /&gt;  renderPage(apiCallOneData, apiCallTwoData)&lt;br /&gt;)&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;The basic idea is that our async api calls will now return a promise object that functions much like an IOU – they can’t give us the results of the api call immediately but they (probably) can at some time in the future. The Promise.when method takes an arbitrary number of promise objects as parameters and then executes the callback in the “then” method once every promise passed to “when” has been completed.&lt;/p&gt;&lt;p&gt;To do this, our api calls would have to be modified to return promise objects, which turns out to be trivial. Such an implementation might look like so:&lt;/p&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;githubApiCallOne = () -&gt;&lt;br /&gt;  promise = new Promise()&lt;br /&gt;&lt;br /&gt;  # async call&lt;br /&gt;  ajaxGet("/repositories", (repository_data) -&gt;&lt;br /&gt;    # fulfill the promise when async call completes&lt;br /&gt;    promise.complete(repository_data)&lt;br /&gt;  )&lt;br /&gt;&lt;br /&gt;  return promise&lt;br /&gt;&lt;br /&gt;githubApiCallTwo = () -&gt;&lt;br /&gt;  promise = new Promise()&lt;br /&gt;&lt;br /&gt;  ajaxGet("/users", (user_data) -&gt;&lt;br /&gt;    promise.complete(user_data)&lt;br /&gt;  )&lt;br /&gt;&lt;br /&gt;  return promise&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;The githubApiCallOne and githubApiCallTwo make their ajax calls but return a promise object immediately. Then when the AJAX calls complete, they can fulfill the promise objects by calling “complete” and passing in their data. Once both promise objects have been fulfilled the callback passed to Promise.then is executed and we render the page.With jQuery&lt;/p&gt;&lt;p&gt;The good news is if you’re already using jQuery you get Promises for free. As of jQuery 1.5 all the $.ajax methods (eg. $.get, $.post etc) return promises which allows you to do this:&lt;/p&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;promise1 = $.get "http://foo.com"&lt;br /&gt;promise2 = $.post "http://boo.com"&lt;br /&gt;&lt;br /&gt;$.when(promise1, promise2)&lt;br /&gt; .then (promise1Result, promise2Result) -&gt;&lt;br /&gt;  # do something with the data&lt;br /&gt;&lt;/pre&gt;&lt;h3&gt;What if I cant use jQuery?&lt;/h3&gt;&lt;p&gt;Rolling a custom implementation of Promises isn’t recommended for production code but might be necessary if you write a lot of 3rd party Javascript and/or just want to try it for fun. Here’s a very basic implementation to get you started. Error handling, exceptions etc are left as an exercise to the reader.&lt;/p&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;class Promise&lt;br /&gt;  @when: (tasks...) -&gt;&lt;br /&gt;    num_uncompleted = tasks.length &lt;br /&gt;    args = new Array(num_uncompleted)&lt;br /&gt;    promise = new Promise()&lt;br /&gt;&lt;br /&gt;    for task, task_id in tasks&lt;br /&gt;      ((task_id) -&gt;&lt;br /&gt;        task.then(() -&gt;&lt;br /&gt;          args[task_id] = Array.prototype.slice.call(arguments)&lt;br /&gt;          num_uncompleted--&lt;br /&gt;          promise.complete.apply(promise, args) if num_uncompleted == 0&lt;br /&gt;        )&lt;br /&gt;      )(task_id)&lt;br /&gt;&lt;br /&gt;    return promise&lt;br /&gt;    &lt;br /&gt;  constructor: () -&gt;&lt;br /&gt;    @completed = false&lt;br /&gt;    @callbacks = []&lt;br /&gt;&lt;br /&gt;  complete: () -&gt;&lt;br /&gt;    @completed = true&lt;br /&gt;    @data = arguments&lt;br /&gt;    for callback in @callbacks&lt;br /&gt;      callback.apply callback, arguments&lt;br /&gt;&lt;br /&gt;  then: (callback) -&gt;&lt;br /&gt;    if @completed == true&lt;br /&gt;      callback.apply callback, @data&lt;br /&gt;      return&lt;br /&gt;&lt;br /&gt;    @callbacks.push callback&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;Sharp eyed readers might notice that the code inside the for loop in the Promise.when method looks a bit strange. You might notice that I’m wrapping the promise’s “then” method call inside of a self executing function that passes in the task_id variable. This funkiness is actually required due to the way that closures work in Javascript. If you attempt to reference the task_id without the self executing closure, you’ll actually get a reference to the task_id iterator instead of a copy – which means by the time your “then” methods execute the loop will have finished iterating and all the task_ids will share the same value! To get around this you have to create a new scope and pass in the iterator so we end up with a copy of the value instead of a reference.&lt;/p&gt;&lt;p&gt;And Finally an example using the supplied Promise class to prove it works:&lt;/p&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;delay = (string) -&gt;&lt;br /&gt;  promise = new Promise()&lt;br /&gt;  setTimeout(() -&gt; &lt;br /&gt;    promise.complete string&lt;br /&gt;  ,200)&lt;br /&gt;  return promise&lt;br /&gt;&lt;br /&gt;logEverything = (fooData, barData, bazData) -&gt; &lt;br /&gt;  console.log fooData[0], barData[0], bazData[0]&lt;br /&gt;&lt;br /&gt;window.onload = () -&gt;&lt;br /&gt;  Promise.when(&lt;br /&gt;    delay("foo"),&lt;br /&gt;    delay("bar"),&lt;br /&gt;    delay("baz")&lt;br /&gt;  ).then logEverything&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7026112140242556504?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7026112140242556504/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7026112140242556504' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7026112140242556504'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7026112140242556504'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/12/promises-in-javascriptcoffeescript.html' title='Promises in Javascript/Coffeescript'/><author><name>Josh Carver</name><uri>http://www.blogger.com/profile/15167764329841650102</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-573946188593724482</id><published>2011-12-08T15:53:00.001-08:00</published><updated>2011-12-08T16:37:36.653-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='iam'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><title type='text'>IAM for DevOps Organizations</title><content type='html'>I gave a talk last month at the &lt;a href="http://www.advancedaws.org/"&gt;Advanced AWS Meetup&lt;/a&gt; on how we're using &lt;a href="http://aws.amazon.com/iam/"&gt;IAM&lt;/a&gt;.&lt;a href="http://www.advancedaws.org/meetup_assets/33538982/IAM_at_Bizo.pdf"&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/misc/images/iam-bizo.png"&gt;&lt;/a&gt;Slides are &lt;a href="http://www.advancedaws.org/meetup_assets/33538982/IAM_at_Bizo.pdf"&gt;available here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-573946188593724482?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/573946188593724482/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=573946188593724482' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/573946188593724482'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/573946188593724482'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/12/iam-for-devops-organizations.html' title='IAM for DevOps Organizations'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7931824801535594096</id><published>2011-11-21T14:39:00.001-08:00</published><updated>2011-11-21T15:37:42.492-08:00</updated><title type='text'>SVG Charts, Done Right.</title><content type='html'>&lt;p&gt;Here at Bizo we like many companies use charts visualizations in our products. They are fantastic tools for making the lives of our users easier. But unfortunately most charting libraries aren't that great. Some, like Google's visualizations have overly complex apis where you have to contort your data into intermediary objects and zero pad data. Others like flot have poor default options that require you to override almost every option just to get something that looks halfway presentable.&lt;p&gt;Our initial requirements for a charting library seemed pretty simple:&lt;/p&gt;&lt;ol&gt; &lt;li&gt;Good looking defaults&lt;/li&gt; &lt;li&gt;An api that didn't force a developer to do mental gymnastics just to load in data&lt;/li&gt; &lt;li&gt;IE support&lt;/li&gt; &lt;li&gt;No flash&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;But after looking around for a while it became apparent that none of the existing libraries really met our needs. Every library we looked at fell flat in one of these four requirements. So I set out to write a charting library from scratch during nights and weekends. Eleven days later version 1.0 of Raphy Charts was born. Raphy Charts (&lt;a href="http://jcarver989.github.com/raphy-charts/"&gt;http://jcarver989.github.com/raphy-charts/&lt;/a&gt;) an html5/canvas charting library built ontop of Raphael (&lt;a href="http://raphaeljs.com/"&gt;http://raphaeljs.com/&lt;/a&gt;) that includes the following features:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Great looking defaults (see example below)&lt;/li&gt;&lt;li&gt;Easy api that allows you to pass a normal 2d Javascript array of x,y points without the need to pad your x-labels.&lt;/li&gt;&lt;li&gt;IE support for version 7+&lt;/li&gt;&lt;li&gt;SVG or VML (older IEs) charts with no Flash.&lt;/li&gt;&lt;/ul&gt;&lt;a href="http://jcarver989.github.com/raphy-charts/"&gt;Get Raphy Charts&lt;/a&gt;&lt;script src="https://raw.github.com/DmitryBaranovskiy/raphael/master/raphael-min.js"&gt;&lt;/script&gt;&lt;script src="https://raw.github.com/jcarver989/raphy-charts/master/compiled/charts.min.js"&gt;&lt;/script&gt;&lt;script type="text/javascript"&gt;(function() {  var create_date, create_exponential_points, create_random_points2, create_squared_points, draw_bars;  create_date = function(day) {    var d;    d = new Date();    return new Date(d.getFullYear(), d.getMonth(), day + 1);  };  create_exponential_points = function() {    var i, _results;    _results = [];    for (i = 0; i &lt;= 25; i++) {      _results.push([create_date(i), i * 4.]);    }    return _results;  };  create_squared_points = function() {    var i, _results;    _results = [];    for (i = 0; i &lt;= 25; i++) {      _results.push([create_date(i), i * (i - 1)]);    }    return _results;  };  create_random_points2 = function() {    var i, _results;    _results = [];    for (i = 0; i &lt;= 25; i++) {      _results.push([create_date(i), Math.random() * i]);    }    return _results;  };  draw_bars = function(r, points) {    var attach_handler, i, point, rect, x, _len, _results;    attach_handler = function(element) {      element.mouseover(function() {        return element.attr({          "fill": "#333"        });      });      return element.mouseout(function() {        return element.attr({          "fill": "#00aadd"        });      });    };    x = points[0].x;    _results = [];    for (i = 0, _len = points.length; i &lt; _len; i++) {      point = points[i];      rect = r.rect(x - 15, point.y, 15, 300 - point.y);      x += 16;      rect.attr({        "fill": "#00aadd",        "stroke": "transparent",        "stroke-width": "0"      });      attach_handler(rect);      _results.push(new Charts.Tooltip(r, rect, Math.floor(points[i].y)));    }    return _results;  };  window.onload = function() {    var c, charts, colors, conversions_chart, data, data_pair, i, result, signups_chart, sparkline_options, visitors_chart, x, y, yvals, _i, _len, _len2, _ref, _ref2;    charts = Charts;    c = new Charts.LineChart('chart1', {      show_y_labels: false    });    c.add_line({      data: create_exponential_points(),      options: {        line_color: "#55bb00",        area_color: "#55bb00",        dot_color: "#55bb00"      }    });    c.add_line({      data: create_squared_points()    });    c.draw();  };}).call(this);&lt;/script&gt;&lt;div id="chart1" style="width: 500px; height: 300px;"&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7931824801535594096?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7931824801535594096/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7931824801535594096' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7931824801535594096'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7931824801535594096'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/11/svg-charts-done-right.html' title='SVG Charts, Done Right.'/><author><name>Josh Carver</name><uri>http://www.blogger.com/profile/15167764329841650102</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-1706739937190703808</id><published>2011-11-01T18:28:00.001-07:00</published><updated>2011-11-01T18:32:12.583-07:00</updated><title type='text'>5 minute web framework review : reading params</title><content type='html'>&lt;p&gt;Through various experiments, hackdays, conversations with other developers, etc. I've found myself experimenting with a few different web frameworks.  The focus has been mostly simple webapps / simple REST services written in scala that return html or json.  I thought it might be interesting to dive into some focused comparisons in a series of posts.&lt;/p&gt;&lt;p&gt;This is not an exhaustive comparison.  I'm going to be focusing on the frameworks I've found the most interesting for my use cases lately, namely &lt;a href="http://www.scalatra.org/"&gt;scalatra&lt;/a&gt;, &lt;a href="http://scala.playframework.org/"&gt;play&lt;/a&gt;, and &lt;a href="http://wikis.sun.com/display/Jersey/Main"&gt;jersey&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;For the first comparison, I want to focusing on reading query and path parameters.  Parameter de-serialization has always been a pain.  The web uses strings, and strings are messy.  Are my required params specified? Are they the right types? Can I easily convert to the types my program expects? Do they pass my validation? etc.&lt;/p&gt;&lt;p&gt;Let's look at how each framework helps us deal with these common concerns.&lt;/p&gt;&lt;h1&gt;jersey&lt;/h1&gt;I like &lt;a href="http://wikis.sun.com/display/Jersey/Main"&gt;jersey&lt;/a&gt;.  It's a reference implementation of &lt;a href="http://jsr311.java.net/"&gt;JSR-311: Java API for RESTful Web Services&lt;/a&gt;. It's also quite nice to work with in scala.&lt;h2&gt;query parameters&lt;/h2&gt;&lt;p&gt;With jersey, query parameters are simply specified as method parameters.  Simple types are automatically converted, and it's easy to specify defaults.  It will also automatically call converters for use with your own complex types.  Unfortunately, you must use annotations to map query param names to method params.&lt;/p&gt;&lt;p&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt; def doGet(@QueryParam("name") name: String,&lt;br /&gt;           @QueryParam("count") @DefaultValue("2") count: Int): String = {&lt;br /&gt;   "name: %s, count: %d\n".format(name, count)&lt;br /&gt; }&lt;br /&gt;&lt;/pre&gt;&lt;/p&gt;&lt;p&gt;It works pretty much as you'd expect:&lt;pre class="prettyprint"&gt;&lt;br /&gt;$ curl "http://localhost:8080/hello?name=larry&amp;count=5"&lt;br /&gt;name: larry, count: 5&lt;br /&gt;&lt;/pre&gt;&lt;/p&gt;&lt;p&gt;If the types aren't correct, you'll get a 404:&lt;pre class="prettyprint"&gt;&lt;br /&gt;curl "http://localhost:8080/hello?name=larry&amp;count=a" -D -&lt;br /&gt;HTTP/1.1 404 Not Found&lt;br /&gt;&lt;/pre&gt;&lt;/p&gt;&lt;h2&gt;path parameters&lt;/h2&gt;&lt;p&gt;Path parameters in jersey work pretty much the same way as query params, i.e. typed, with default values and appearing as method arguments.  Additionally you can do some simple validation using regular expressions.  Their names and path location are specified when defining the route.&lt;/p&gt;&lt;p&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;@Path("/hello/{userid}")&lt;br /&gt;class Hello {&lt;br /&gt;  def doGet(@PathParam("userid") id: Int)&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;/p&gt;&lt;p&gt;Here's a different example showing some simple regex validation support:&lt;pre class="prettyprint"&gt;&lt;br /&gt;@Path("/hello/{username: [a-zA-Z][a-zA-Z_0-9]+}")&lt;br /&gt;&lt;/pre&gt;And, again, if the path doesn't match your regex, or type, you will get a 404.&lt;/p&gt;&lt;h2&gt;other params&lt;/h2&gt;&lt;p&gt;There are also &lt;code&gt;@CookieParam&lt;/code&gt;, &lt;code&gt;@HeaderParam&lt;/code&gt; annotations for reading cookie and header values, as well as support for pulling in session or request variables using &lt;code&gt;@Context&lt;/code&gt; or custom annotations (e.g. I've created &lt;code&gt;@IpAddress&lt;/code&gt; for pulling in the ip).&lt;/p&gt;&lt;h2&gt;overall thoughts&lt;/h2&gt;&lt;p&gt;I really like the automatic de-serialization and type conversion, and having the framework handle errors for incompatible parameters automatically.&lt;/p&gt;&lt;p&gt;I also like the POJO mindset.  It's just a function with arguments like any other.  All else being equal, this makes testing in any framework super easy.&lt;/p&gt;&lt;p&gt;The annotations do seem a little noisy, especially having to specify the parameter name.  I think we can do better.&lt;/p&gt;&lt;h1&gt;scalatra&lt;/h1&gt;&lt;p&gt;&lt;a href="http://www.scalatra.org/"&gt;scalatra&lt;/a&gt; is also very nice for simple apps, and I've quickly become a fan of &lt;a href="http://scalate.fusesource.org/"&gt;scalate&lt;/a&gt; which it uses for templating.&lt;/p&gt;&lt;p&gt;When it comes to dealing with parameters though, it feels like a step back.  Everything is strings.  The fact that it's scala makes it easier to deal with, but it does feel like the framework could go a little further to help you out.&lt;/p&gt;&lt;h2&gt;query parameters&lt;/h2&gt;&lt;p&gt;To read query parameters, you use the &lt;code&gt;params&lt;/code&gt; method from &lt;a href="http://www.scalatra.org/2.0/api/org/scalatra/ScalatraServlet.html"&gt;ScalatraServlet&lt;/a&gt;.  &lt;code&gt;params&lt;/code&gt; is a &lt;a href="http://www.scalatra.org/2.0/api/org/scalatra/util/MultiMapHeadView.html"&gt;MultiMapHeadView&lt;/a&gt;[String, String]&lt;/a&gt;.  So yes, you are back to dealing with Strings (or a Seq[String] if using &lt;code&gt;multiParams&lt;/code&gt;).&lt;/p&gt;&lt;p&gt;E.g.&lt;pre class="prettyprint"&gt;&lt;br /&gt;get("/hello") {&lt;br /&gt;  val name:String = params.getOrElse("name", halt(400))&lt;br /&gt;  val count:Int = params.getOrElse("count", "2").toInt&lt;br /&gt;&lt;br /&gt;  "name: %s, count: %d\n".format(name, count)&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;Calling this path without a name will generate a 400, as expected:&lt;pre class="prettyprint"&gt;&lt;br /&gt;$ curl "http://localhost:8080/hello" -D - &lt;br /&gt;HTTP/1.1 400 Bad Request&lt;br /&gt;&lt;/pre&gt;If you don't specify count, you will get the default of 2.  However, if you specify a non-int, you'll get a 200 where the contents are the stack trace for the &lt;code&gt;toInt&lt;/code&gt; call.  Again, your validation is all manual -- if you want better type validation, it's up to you.&lt;pre class="prettyprint"&gt;&lt;br /&gt;2:~ larry$ curl "http://localhost:8080/hello?name=larry&amp;count=a" -D -&lt;br /&gt;HTTP/1.1 200 OK&lt;br /&gt;…&lt;br /&gt;&amp;lt;p&amp;gt;&lt;br /&gt;java.lang.NumberFormatException: For input string: &amp;quot;a&amp;quot;&lt;br /&gt;&amp;lt;/p&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;/p&gt;&lt;h2&gt;path parameters&lt;/h2&gt;&lt;p&gt;Path params work exactly the same way (including being accessed in &lt;code&gt;params&lt;/code&gt;), and are named as part of your route:&lt;pre class="prettyprint"&gt;&lt;br /&gt;get("/hello/:name/:count") {&lt;br /&gt;  val name:String = params.getOrElse("name", halt(400))&lt;br /&gt;  val count:Int = params.getOrElse("count", "2").toInt&lt;br /&gt;&lt;br /&gt;  "name: %s, count: %d\n".format(name, count)&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;$ curl "http://localhost:8080/hello/larry/5"&lt;br /&gt;name: larry, count: 5&lt;br /&gt;&lt;/pre&gt;&lt;/p&gt;&lt;h2&gt;overall thoughts&lt;/h2&gt;&lt;p&gt;The manual de-serialization seems a little dated, and gets old quick.  Scala does make it nicer than it would be in java, since you can do things like &lt;code&gt;params.getOrElse("name", halt(400))&lt;/code&gt;, but I would like to see more.&lt;/p&gt;&lt;p&gt;I also miss the POJO mindset… when testing you need to do whatever additional setup is necessary to serialize your params as strings and stick them in a map.&lt;/p&gt;&lt;p&gt;I guess I also don't like that barring convention, there's no formal definition of what parameters you are expecting and what their types are - maybe you are calling params.get somewhere in the middle of your method..&lt;/p&gt;&lt;h1&gt;play&lt;/h1&gt;&lt;p&gt;&lt;a href="http://scala.playframework.org/"&gt;play&lt;/a&gt; the framework feels a little heavy compared to jersey and scalatra, but if definitely shines when it comes to dealing with parameters. &lt;/p&gt;&lt;h2&gt;query parameters&lt;/h2&gt;&lt;p&gt;Query parameters in play are done really well.  It's almost perfect.&lt;/p&gt;&lt;p&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;def hello(name: String, count: Int = 2) = {&lt;br /&gt;  "name: %s, count: %d\n".format(name, count)&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;$ curl "http://localhost:9000/hello?name=larry"&lt;br /&gt;name: larry, count: 2&lt;br /&gt;&lt;/pre&gt;You can even use &lt;code&gt;Option&lt;/code&gt; for parameters that may be available:&lt;pre&gt;&lt;br /&gt;def hello(name: Option[String], count: Int = 2) = {&lt;br /&gt;  "name: %s, count: %d\n".format(name.getOrElse("anon"), count)&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;/p&gt;&lt;p&gt;One problem is that type conversion errors are silently ignored, and defaults will be used:&lt;pre class="prettyprint"&gt;&lt;br /&gt;$ curl "http://localhost:9000/hello?name=larry&amp;count=a"&lt;br /&gt;name: larry, count: 2&lt;br /&gt;&lt;/pre&gt;Okay, so they're not really ignored.  If you call &lt;code&gt;Validation.hasErrors&lt;/code&gt;, it will return true, and youcan discover the error.  This is the same mechanism you need to use to mark parameter as required:&lt;pre class="prettyprint"&gt;&lt;br /&gt;def hello(name: String, count: Int = 2) = {&lt;br /&gt;  Validation.required("name", name)&lt;br /&gt;  if (Validation.hasErrors) {&lt;br /&gt;    // handle error&lt;br /&gt;&lt;/pre&gt;&lt;/p&gt;&lt;h2&gt;path parameters&lt;/h2&gt;&lt;p&gt;Path parameters work the same way.  They're defined with placeholders in your route, and automatically passed in as the correct argument.  In play, routes are defined external to your code, in a &lt;code&gt;routes&lt;/code&gt; file.  E.g.&lt;pre class="prettyprint"&gt;&lt;br /&gt;GET /hello/{name} Application.hello&lt;br /&gt;&lt;/pre&gt;Our method looks the same as the first Query param example.  Calling it looks like this:&lt;pre class="prettyprint"&gt; &lt;br /&gt;$ curl "http://localhost:9000/hello/larry"&lt;br /&gt;name: larry, count: 2&lt;br /&gt;&lt;/pre&gt;In the case of path parameters, we will get a 404 if missing the parameter (since it won't match our route).&lt;pre class="prettyprint"&gt;&lt;br /&gt;$ curl "http://localhost:9000/hello/" -D -&lt;br /&gt;&lt;br /&gt;HTTP/1.1 404 Not Found&lt;br /&gt;&lt;/pre&gt;&lt;/p&gt;&lt;h2&gt;overall thoughts&lt;/h2&gt;&lt;p&gt;Overall I think parameters in play are done really well.&lt;/p&gt;&lt;p&gt;Like jersey, I really appreciate the POJO approach.  play does it even better by eliminating the extraannotations and leveraging scala's default argument support.&lt;/p&gt;&lt;p&gt;Validation does seem a little clunky, though.  It seems like more could be done there.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-1706739937190703808?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/1706739937190703808/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=1706739937190703808' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1706739937190703808'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1706739937190703808'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/11/5-minute-web-framework-review-reading.html' title='5 minute web framework review : reading params'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7215609983234557581</id><published>2011-10-13T07:49:00.000-07:00</published><updated>2011-10-13T07:58:42.592-07:00</updated><title type='text'>Advanced Amazon Web Services Meetup</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-FJgdUkyOpLU/Tpb8kjMDZgI/AAAAAAAABSs/K-9MsRG7uyE/s1600/logo_aws.gif"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 164px; height: 60px;" src="http://2.bp.blogspot.com/-FJgdUkyOpLU/Tpb8kjMDZgI/AAAAAAAABSs/K-9MsRG7uyE/s400/logo_aws.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5662991286316262914" /&gt;&lt;/a&gt;&lt;br /&gt;Bizo Engineering, with the help of Amazon Web Servies, has developed a &lt;a href="http://www.meetup.com/AdvancedAWS/"&gt;new meetup &lt;/a&gt;in San Francisco that focuses on &lt;a href="http://www.meetup.com/AdvancedAWS/"&gt;Advanced Amazon Web Services&lt;/a&gt; topics.  The main concept is that there are many companies who have been operating on AWS for a number of years and have significant experience but do not have a forum to discuss more advanced issues and architectures.  &lt;br /&gt;&lt;br /&gt;The next meetup is on Tuesday October 18th at 7pm in San Francisco (&lt;a href="http://www.meetup.com/AdvancedAWS/events/33538982/"&gt;more details&lt;/a&gt;).  We have speakers from &lt;a href="http://bizo.com"&gt;Bizo&lt;/a&gt;, &lt;a href="http://www.twilio.com/"&gt;Twilio&lt;/a&gt; and &lt;a href="http://aws.amazon.com"&gt;Amazon Web Services&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;We look forward to meeting other AWS companies and sharing stories from the trenches.  Hope to see you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7215609983234557581?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7215609983234557581/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7215609983234557581' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7215609983234557581'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7215609983234557581'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/10/advanced-amazon-web-services-meetup.html' title='Advanced Amazon Web Services Meetup'/><author><name>Donnie</name><uri>http://www.blogger.com/profile/13599133732419522440</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-FJgdUkyOpLU/Tpb8kjMDZgI/AAAAAAAABSs/K-9MsRG7uyE/s72-c/logo_aws.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8054473549048285011</id><published>2011-10-10T13:29:00.001-07:00</published><updated>2011-10-10T13:31:52.584-07:00</updated><title type='text'>the story of HOD : ahead of its time, obsolete at launch</title><content type='html'>&lt;p&gt;Last week we shut down an early part of the bizo infrastructure : HOD (Hadoop on Demand).  I thought it might be fun to look back on this project a bit.&lt;/p&gt;&lt;p&gt;We've been using AWS as long as bizo has been around, since early 2008.  Hadoop has always been a big part of that.  When we first started, we were mostly using a shared hadoop cluster.  This was kind of a pain for job scheduling, but also was mostly wasteful during off-peak hours…  Thus, HOD was born.&lt;/p&gt;&lt;p&gt;From its documentation, "The goal of HOD is to provide an on demand, scalable, sandboxed infrastructure to run Hadoop jobs."  &lt;a href="http://aws.amazon.com/elasticmapreduce/"&gt;Sound familiar&lt;/a&gt;?  HOD was developed late September and October of 2008, and launched for internal use December 12th, 2008.  Amazon announced &lt;a href="http://aws.amazon.com/elasticmapreduce/"&gt;EMR&lt;/a&gt; April of 2009.  It's amazing how similar they ended up being… especially since we had no knowledge of EMR at the time.&lt;/p&gt;&lt;p&gt;Even though HOD had a few nice features missing from EMR, the writing was on the wall.  For new tasks, we wrote them for EMR.  We slowly migrated old reports to EMR when they needed changes, or we had the time.&lt;/p&gt;&lt;h1&gt;Architecture&lt;/h1&gt;&lt;p&gt;HOD borrowed quite liberally from the design of &lt;a href="http://aws.amazon.com/articles/1632?_encoding=UTF8&amp;jiveRedirect=1"&gt;Alexa's GrepTheWeb&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;&lt;a href="http://com-bizo-public.s3.amazonaws.com/blog/hod/hod_overview.png"&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/hod/hod_overview.png" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;Users submitted job requests to a controller which managed starting exclusive hadoop clusters (master, slaves, hdfs), retrieving job input from S3 to HDFS, executing the job (hadoop map/reduce), monitoring the job, storing job results, and shutting down the cluster on completion.  Job information and status was stored in SimpleDB,  S3 was used for job inputs and outputs, and SQS was used to manage the job workflow.&lt;/p&gt;&lt;h2&gt;Job Definition&lt;/h2&gt;Jobs were defined as thrift structures:&lt;pre style="prettyprint"&gt;&lt;br /&gt; struct JobRequest {&lt;br /&gt;   1: JobConf job = {},&lt;br /&gt;   2: i32 requested_nodes = 4, // requested number of hadoop slaves&lt;br /&gt;   3: string node_type = "m1.large", // machine size&lt;br /&gt;   4: i32 requested_priority, // job priority&lt;br /&gt;   5: string hadoop_dist, // hadoop version e.g. "0.18.3"&lt;br /&gt;   6: set&lt;string&gt; depends_on = [], // ids of job dependencies&lt;br /&gt;   &lt;br /&gt;   7: list&lt;Notification&gt; on_success = [], // success notification&lt;br /&gt;   8: list&lt;Notification&gt; on_failure = [], // failure notification&lt;br /&gt;   9: set&lt;string&gt; flags = [] // everyone loves flags&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; struct JobConf {&lt;br /&gt;   1: string job_name,&lt;br /&gt;   2: string job_jar,  // s3 jar path&lt;br /&gt;   3: string mapper_class,&lt;br /&gt;   4: string combiner_class,&lt;br /&gt;   5: string reducer_class,&lt;br /&gt;&lt;br /&gt;   6: set&lt;string&gt; job_input,  // s3 input paths&lt;br /&gt;   7: string job_output,  // s3 output path (optional)&lt;br /&gt;   &lt;br /&gt;   8: string input_format_class,&lt;br /&gt;   9: string output_format_class,&lt;br /&gt;   &lt;br /&gt;   10: string input_key_class,&lt;br /&gt;   11: string input_value_class,&lt;br /&gt;   &lt;br /&gt;   12: string output_key_class,&lt;br /&gt;   13: string output_value_class,&lt;br /&gt;   &lt;br /&gt;   // list of files for hadoop distributed cache&lt;br /&gt;   14: set&lt;string&gt; user_data = [],&lt;br /&gt;   &lt;br /&gt;   15: map&lt;string,string&gt; other_config = {}, // passed directly JobConf.set(k, v)&lt;br /&gt; }&lt;br /&gt;&lt;/pre&gt;You'll notice that dependencies could be specified.  HOD could hook up the output of 1 or more jobs into the input of a job and wouldn't run the job until all of its dependencies have successfully completed.&lt;h1&gt;User Interaction&lt;/h1&gt;We had a user program, similar to emr-client that helped construct and job jobs, e.g.:&lt;pre class="prettyprint"&gt;&lt;br /&gt;JOB="\&lt;br /&gt; -m com.bizo.blah.SplitUDCMap \&lt;br /&gt; -r com.bizo.blah.UDCReduce \&lt;br /&gt; -jar com-bizo-release:blah/blah/blah.jar&lt;br /&gt; -jobName blah \&lt;br /&gt; -i com-bizo-data:blah/blah/blah/${MONTH} \&lt;br /&gt; -outputKeyClass org.apache.hadoop.io.Text \&lt;br /&gt; -outputValueClass org.apache.hadoop.io.Text \&lt;br /&gt; -nodes 10 \&lt;br /&gt; -nodeType c1.medium \&lt;br /&gt; -dist 0.18.3&lt;br /&gt; -emailSuccess larry@bizo.com \&lt;br /&gt; -emailFailure larry@bizo.com \&lt;br /&gt;"&lt;br /&gt;&lt;br /&gt;$HOD_HOME/bin/hod_submit $JOB $@&lt;br /&gt;&lt;/pre&gt;There was also some nice support for looking at jobs, either by status or by name:&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/hod/hod_status.png" /&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/hod/hod_name.png" /&gt;As well as support for viewing job output, logs, counters, etc.&lt;h1&gt;Nice features&lt;/h1&gt;&lt;p&gt;We've been very happy users of Amazon's EMR since it launched in 2009.  There's nothing better than systems you don't need to support/maintain yourself!  And they've been really busy making EMR more easy to use and adding great features.  Still, there are a few things I miss from HOD.&lt;/p&gt;&lt;h2&gt;Workflow support&lt;/h2&gt;&lt;p&gt;As mentioned, HOD had support for constructing job workflows.  You could wire up dependencies amount multiple jobs.  E.g. here's an example workflow (also mentioned in this blog previously under &lt;a href="http://dev.bizo.com/2009/01/hadoop-job-visualization.html"&gt;hadoop-job-visualization&lt;/a&gt;).&lt;/p&gt;&lt;p&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/hod/acct/hod_acct_01_28_2009.png" /&gt;&lt;/p&gt;&lt;p&gt;It would be nice to see something like this in EMR.  For really simple workflows, you can sometime squeeze them into a single EMR job as multiple steps, but that doesn't always make sense and isn't always convenient.&lt;/p&gt;&lt;h2&gt;Notification support&lt;/h2&gt;&lt;p&gt;HOD supported notifications directly.  Initially just email notifications, but there was a plugin structure in place with an eye towards supporting HTTP endpoint, and SQS notifications.&lt;/p&gt;&lt;p&gt;Yes, this is possible by adding a custom EMR job step at the end that checks the status of itself and sends an email/failure notification…  But, c'mon, why not just build in easy &lt;a href="http://aws.amazon.com/sns/"&gt;SNS support&lt;/a&gt;?  Please?&lt;/p&gt;&lt;h2&gt;Counter support&lt;/h2&gt;&lt;p&gt;Building on that, HOD had direct support/understanding for hadoop counters.  When processing large volumes of data, they become really critical in tracking the health of your reports over time.  This is something I really miss.  Although, it's less obvious how to fold this in with Hive jobs, which is how most of our reports are written these days.&lt;/p&gt;&lt;h2&gt;Arbitrary Hadoop versions&lt;/h2&gt;&lt;p&gt;HOD operated with straight hadoop, so it was possible to have it install an arbitrary version/package just by pointing it to the right distribution in S3.&lt;/p&gt;&lt;p&gt;Since Amazon isn't directly using a distribution from the hadoop/hive teams, you need to wait for them to apply their patches/changes and can only run with versions they directly support.  This has mostly been a problem with Hive, which moves pretty quickly.&lt;/p&gt;&lt;p&gt;It would be really great if they could get to a point where their changes have been folded back into the main distribution.&lt;/p&gt;&lt;p&gt;Of course, this is probably something you can do yourself, again with a custom job step to install your own version of Hive…  Still, they have some nice improvements, and again, it would be nice if it were just a simple option to the job.&lt;/p&gt;&lt;h1&gt;The Past / The Future&lt;/h1&gt;&lt;p&gt;Of course, HOD wasn't without its problems :).  It's become a bear to manage, especially since we pretty much stopped development / maintenance (aside from rebooting it) back in 2009.  It was definitely with a sigh of relief that I pulled the plug.&lt;/p&gt;&lt;p&gt;Still, HOD was a really fun project!  It was an early project for me at Bizo, and it was really amazing how easy it was to write a program that starts up machines! and gets other programs installed and running!  Part of me wonders if there isn't a place for an open source EMR-like infrastructure somewhere?  Maybe for private clouds?  Maybe for people who want/need more control? Or for cheapskates? :) (EMR does &lt;a href="http://aws.amazon.com/elasticmapreduce/#pricing"&gt;cost money&lt;/a&gt;).&lt;/p&gt;&lt;p&gt;Or maybe HOD v2 is just some wrappers around EMR that provides some of the things I miss : workflow support, notifications, easier job configuration…&lt;/p&gt;&lt;p&gt;Something to think about for that next hack day :).&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8054473549048285011?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8054473549048285011/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8054473549048285011' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8054473549048285011'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8054473549048285011'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/10/story-of-hod-ahead-of-its-time-obsolete.html' title='the story of HOD : ahead of its time, obsolete at launch'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-5711692311336984981</id><published>2011-10-05T17:07:00.000-07:00</published><updated>2011-10-05T17:17:56.633-07:00</updated><title type='text'>Writing  better 3rd party Javascript with Coffeescript, Jasmine, PhantomJS  and Dependence.js</title><content type='html'>&lt;p&gt;Here at Bizo we recently underwent a major change to our Javascript tags that’s used in our free analytics product (&lt;a href="http://www.bizo.com/marketer/audience_analytics"&gt;http://www.bizo.com/marketer/audience_analytics&lt;/a&gt;). Our code gets loaded by millions of visitors each month across thousands of web sites; so our Javascript has to run reliably in just about any browser on any page.&lt;/p&gt;&lt;h3&gt;The Old Javascript:&lt;/h3&gt;&lt;p&gt;Unfortunately our codebase had accumulated about three years of cruft,&lt;br /&gt;resulting in a single monolithic Javascript file.  The file contained&lt;br /&gt;a single closure, over 600 lines long, with all kinds of edge cases,&lt;br /&gt;some of which no longer existed. Since you can’t access anything in a&lt;br /&gt;closure from the outside, writing unit tests was nearly impossible and&lt;br /&gt;the code had suffered as a result.&lt;/p&gt;&lt;p&gt;That’s not to say we had no tests – there were several selenium tests&lt;br /&gt;that tested functionality at a high level. The problem however was&lt;br /&gt;making the required changes was going to be a time consuming (and&lt;br /&gt;somewhat terrifying) process. The selenium tests provided a very slow&lt;br /&gt;testing feedback loop and debugging a large closure comes with it’s&lt;br /&gt;own challenges. If it’s scary changing your production code, then&lt;br /&gt;you’re doing something wrong.&lt;/p&gt;&lt;h3&gt;Modularity and Dependency Management&lt;/h3&gt;&lt;p&gt;&lt;br /&gt;So we decided to do a complete overhaul and rewrite our Javascript&lt;br /&gt;tags in CoffeeScript  (smaller code base with clearer code). The&lt;br /&gt;biggest problem the original code had was that it wasn’t modular and&lt;br /&gt;thus difficult to test. Ideally we wanted to split our project into&lt;br /&gt;multiple files that we could unit test.  To do this we needed some&lt;br /&gt;kind of dependency management system for Javascript, which in 2011&lt;br /&gt;surprisingly isn’t standardized yet. We looked at several projects but&lt;br /&gt;none of them really met our needs. Our users are quite sensitive about&lt;br /&gt;the number of http requests 3rd party Javacript makes so solutions&lt;br /&gt;that load several scripts in parallel weren’t an option (ex.&lt;br /&gt;Requirejs).  Others like Sprockets were close but didn’t quite support&lt;br /&gt;everything we needed.&lt;/p&gt;&lt;p&gt;We ended up writing Dependence.js, a gem to manage our Javascript&lt;br /&gt;dependencies. Dependence will compile all your files in a module into&lt;br /&gt;a single file. Features include javascript and/or Coffeescript&lt;br /&gt;compilation, dependency resolution (via a topological sort your&lt;br /&gt;dependency graph), allowing you to use an “exports” object for your&lt;br /&gt;modules interface, and optional compression using Google’s Closure&lt;br /&gt;compiler. Check it out on github:&lt;br /&gt;(&lt;a href="http://github.com/jcarver989/dependence.js"&gt;http://github.com/jcarver989/dependence.js&lt;/a&gt;)&lt;/p&gt;&lt;h3&gt;Fast Unit testing with Phantom.js&lt;/h3&gt;&lt;p&gt;Another way we were looking to improve our Javascript setup was to&lt;br /&gt;have a comprehensive suite of unit tests. After looking at several&lt;br /&gt;possibilities we settled on using the Jasmine test framework&lt;br /&gt;(&lt;a href="http://pivotal.github.com/jasmine/"&gt;http://pivotal.github.com/jasmine/&lt;/a&gt;) in conjunction with PhantomJS (a&lt;br /&gt;headless webkit browser).  So far using Jasmine and PhantomJs together&lt;br /&gt;has been awesome. As our Javascript is inherently DOM coupled, each of&lt;br /&gt;our unit tests executes in a separate iframe (so each test has its own&lt;br /&gt;separate document object). 126 unit tests later the entire suite runs&lt;br /&gt;locally in about 0.1 seconds!&lt;/p&gt;&lt;h3&gt;Functional Testing with Selenium&lt;/h3&gt;&lt;p&gt;Our functional tests are still executed with Selenium webdriver.&lt;br /&gt;Although there are alternative options such as HtmlUnit, we wanted to&lt;br /&gt;test our code in real browsers and for this Selenium is still the best&lt;br /&gt;option around. A combination of capybara and rspec make for writing&lt;br /&gt;functional tests with a nicer api than then raw selenium. A bonus is&lt;br /&gt;that capybara allows you to swap out selenium in favor of another&lt;br /&gt;driver should we ever want to switch to something else. Lastly a&lt;br /&gt;custom gem for creating static html fixtures allows us to&lt;br /&gt;programmatically generate test pages for each possible configuration&lt;br /&gt;option found in our Javascript module. You can find that here:&lt;br /&gt;(&lt;a href="http://github.com/jcarver989/js-fixtures"&gt;http://github.com/jcarver989/js-fixtures&lt;/a&gt;).&lt;/p&gt;&lt;h3&gt;Wrapping up&lt;/h3&gt;&lt;p&gt;The new code is far more modular, comprehensively tested and way&lt;br /&gt;easier to extend. Overall working with Dependence.js, CoffeeScript,&lt;br /&gt;PhantomJs, Capybara, Rspec and Selenium has been a workflow that works&lt;br /&gt;great for us. If you have a different workflow that you like for&lt;br /&gt;Javascript projects, let us know!&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-5711692311336984981?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/5711692311336984981/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=5711692311336984981' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5711692311336984981'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5711692311336984981'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/10/writing-better-3rd-party-javascript.html' title='Writing  better 3rd party Javascript with Coffeescript, Jasmine, PhantomJS  and Dependence.js'/><author><name>Josh Carver</name><uri>http://www.blogger.com/profile/15167764329841650102</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-5742099798421000706</id><published>2011-08-24T13:58:00.001-07:00</published><updated>2011-08-24T14:27:18.911-07:00</updated><title type='text'>Cloudwatch metrics revisited</title><content type='html'>In a &lt;a href="http://dev.bizo.com/2011/05/cloudwatch-custom-metrics-bizo.html"&gt;previous post&lt;/a&gt;, I discussed our initial usage of cloudwatch custom metrics.  Since then, we've added more metrics and changed how we're recording them, so I thought it might be helpful to revisit the topic.&lt;h2&gt;Metric Namespaces&lt;/h2&gt;Initially we had a single namespace per application.  We decided that the stage should be included in the namespace.  E.g. api-web-prod, api-web-dev.  It seems to make sense to keep metrics from different stages completely separate, especially if you are using them for alerting or scaling events.&lt;h2&gt;Metric Regions&lt;/h2&gt;When we started, we were logging all metrics to us-east (may have been a requirement of the beta?).  Going forward, it made sense to log to the specific region where the events occurred.  There's a little more work if you want to aggregate across regions, but it matches the rest of our infrastructure layout better.  Also, if you want to use metrics for auto-scaling events, it's a requirement.&lt;h2&gt;Dropping InstanceId dimensions (by default)&lt;/h2&gt;&lt;p&gt;This is something we are currently working on rolling out.  When we first started logging events, we would include a metric update tagged with the InstanceId.  This mirrors how the built-in AWS metrics work.  It seemed like it would be useful to be able to "drill down" when investigating an issue, i.e. the Maximum CPU utilization in this group is at 100%, okay, which instance is it?&lt;/p&gt;&lt;p&gt;In practice, we have started to question the utility versus cost, especially for custom metrics.  When you run large services with auto-scaling, you end up generating a lot of metrics for very transient instances.  Since the cost structure is based on the number of unique metrics used, this can really add up.&lt;/p&gt;&lt;p&gt;For some numbers, looking at the output of mon-list-metrics in us-east-1 only, we have 31,888 metrics with an InstanceId dimension!  That's just for the last 2 weeks.  If we were paying for all of those (luckily most of them are for built-in metrics), it would cost us $15k for those 2 weeks of metrics on very transient instances.&lt;/p&gt;&lt;p&gt;It has been useful to have InstanceId granularity metrics in the past, and in a perfect world maybe we'd still be collecting them, but with the current price structure it's just too expensive for most of our (auto-scaled) services.&lt;/p&gt;&lt;h2&gt;Metric Dimensions revisited&lt;/h2&gt;When we first started using cloudwatch custom metrics, we would log the following dimensions for each event:&lt;ul&gt;&lt;li&gt;Version, e.g. 124021 (svn revision number)&lt;/li&gt;&lt;li&gt;Stage, e.g. prod&lt;/li&gt;&lt;li&gt;Region, e.g. us-west-1&lt;/li&gt;&lt;li&gt;Application, e.g. api-web&lt;/li&gt;&lt;li&gt;InstanceId, e.g. i-201345a&lt;/li&gt;&lt;/ul&gt;We can drop Stage and Region due to our namespace and region changes above.  As mentioned, we've also decide to drop InstanceId for most of our services.  This makes our current list of dimension defaults:&lt;ul&gt;&lt;li&gt;Version, e.g. 124021 (svn revision number)&lt;/li&gt;&lt;li&gt;Application, e.g. api-web&lt;/li&gt;&lt;/ul&gt;We're still tracking stage and region, based on the namespace or region, they just don't need to be expressed as dimensions.&lt;h2&gt;More Metrics!&lt;/h2&gt;One of our developers, &lt;a href="https://github.com/balshor"&gt;Darren&lt;/a&gt;, put together a JMX-&gt;Cloudwatch bridge.  Each application can express which JMX stats it would like to export via a JSON config file.  Here's a short except that will send HeapMemoryUsage to cloudwatch every 60 seconds:&lt;pre class="prettyprint"&gt;&lt;br /&gt;  {&lt;br /&gt;    "objectName" : "java.lang:type=Memory",&lt;br /&gt;    "attribute" : "HeapMemoryUsage",&lt;br /&gt;    "compositeDataKey" : "used",&lt;br /&gt;    "metricName" : "HeapMemoryUsage",&lt;br /&gt;    "unit" : "Bytes",&lt;br /&gt;    "frequency" : 60,&lt;br /&gt;  },&lt;br /&gt;&lt;/pre&gt;I'm sure the list will grow, but some of the metrics we've found most useful so far:&lt;ul&gt;  &lt;li&gt;NonHeapMemoryUsage&lt;/li&gt;  &lt;li&gt;HeapMemoryUsage&lt;/li&gt;  &lt;li&gt;OpenFileDescriptorCount&lt;/li&gt;  &lt;li&gt;SystemLoadAverage&lt;/li&gt;  &lt;li&gt;ThreadCount&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;I'm hoping Darren will describe the bridge in more detail in a future post.  It's made it really easy for applications to push system metrics to cloudwatch.&lt;/p&gt;&lt;p&gt;Of course, we're also sending a lot of application specific event metrics.&lt;/p&gt;&lt;h2&gt;Homegrown interfaces&lt;/h2&gt;&lt;p&gt;The AWS cloudwatch console is &lt;em&gt;really&lt;/em&gt; slow. It also only seems like it will load 5,000 metrics.  Our us-east "AWS/EC2" namespace alone has 28k metrics.  Additionally, you can only view metrics for a single region at a time.  We just haven't had a lot of success with the web console.&lt;/p&gt;&lt;p&gt;We've been relying pretty heavily on the command line tools for investigation, which can be a little tedious.&lt;/p&gt;&lt;p&gt;We've also written some scripts that will aggregate daily metrics for each app and insert them into a google docs spreadsheet to help track trends.&lt;/p&gt;&lt;p&gt;For our last hack day, I started working on a (very rough!) prototype for a custom cloudwatch console.&lt;a href="http://com-bizo-public.s3.amazonaws.com/blog/hackday/cw-console/custom-dash.png"&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/hackday/cw-console/custom-dash-sm.png"/&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;The app is written using &lt;a href="http://scala.playframework.org/"&gt;play (scala)&lt;/a&gt; with &lt;a href="http://code.google.com/p/flot/"&gt;flot&lt;/a&gt; for the graphs.&lt;/p&gt;&lt;p&gt;It heavily caches the namespace/metric/dimension/value hierarchies, and queries all regions simultaneously.  It certainly feels much faster than the built-in console.&lt;/p&gt;&lt;p&gt;It's great just being able to quickly graph metrics by name, but my main motivation for this console was to provide a place where we could start to inject some intelligence about our metrics.  The cloudwatch interface has to be really generic to support a wide range of uses/metrics.  For our metrics, we have a better understanding of what they mean and how they're related.  E.g. If the ErrorCount metric is high, we know which other metrics/dimensions can help us drill down and find the cause.  I'm hoping to build those kinds of relationships into this dashboard.&lt;/p&gt;&lt;h2&gt;Summary&lt;/h2&gt;&lt;p&gt;So that's how we're currently using cloudwatch at bizo.  There are still some rough edges, but we've been pretty happy with it.  It's really easy to log and aggregate metric data with hardly any infrastructure.&lt;/p&gt;&lt;p&gt;I'd love to hear any other experiences, comments, uses people have had with cloudwatch.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-5742099798421000706?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/5742099798421000706/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=5742099798421000706' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5742099798421000706'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5742099798421000706'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/08/cloudwatch-metrics-revisited.html' title='Cloudwatch metrics revisited'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-1605456710277851994</id><published>2011-08-19T14:22:00.000-07:00</published><updated>2011-10-25T15:57:49.822-07:00</updated><title type='text'>Report delivery from Hive via Google Spreadsheets</title><content type='html'>&lt;p&gt;At Bizo, we run a number of periodically scheduled Hive jobs produce a high level summary as just a few (often, just one) row of data.  In the past, we’ve simply used the same delivery mechanism as with larger reports; the output is emailed as a CSV file to the appropriate distribution list.  This was less than ideal for a number of reasons:&lt;/p&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Managing the distribution lists is difficult.  We either needed to create a new list for each type of report, giving us a lot of lists to manage, or just send reports to a generic distribution list, resulting in a lot of unnecessary emails to people who weren’t necessarily interested in the report.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Handling the historical context is manual; the report needs to pull in past results to include in the output or recipients of the output need to find older emails to see trends appear.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Report delivery required an additional step in the job workflow outside of the Hive script.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;p&gt;With the GData storage handler, we now just create a Google Spreadsheet, add appropriate column headers, and do something like this in our script:&lt;/p&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;add jar gdata-storagehandler.jar ;&lt;br /&gt;&lt;br /&gt;create external table gdata_output(&lt;br /&gt;  day string, cnt int, source_class string, source_method string, thrown_class string&lt;br /&gt;)&lt;br /&gt;stored by 'com.bizo.hive.gdata.GDataStorageHandler'&lt;br /&gt;with serdeproperties (&lt;br /&gt;  "gdata.user" = "user@bizo.com",&lt;br /&gt;  "gdata.consumer.key" = "bizo.com",&lt;br /&gt;  "gdata.consumer.secret" = "...",&lt;br /&gt;  "gdata.spreadsheet.name" = "Daily Exception Summary",&lt;br /&gt;  "gdata.worksheet.name" = "My Application",&lt;br /&gt;  "gdata.columns.mapping" = "day,count,class,method,thrown"&lt;br /&gt;)&lt;br /&gt;;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;This appends whatever data is written to the table to the specified spreadsheet.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The source code is available &lt;a href="https://github.com/balshor/gdata-storagehandler"&gt;here&lt;/a&gt;.  If you’re running your jobs on Amazon’s Elastic MapReduce, you can access the storage handler by adding the following line to your Hive script:&lt;/p&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;add jar s3://com-bizo-public/hive/storagehandler/gdata-storagehandler-0.1.jar ;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;Note that the library only supports 2-legged OAuth access to Google Apps for Domains, which needs to be enabled in your Google Apps control panel.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-1605456710277851994?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/1605456710277851994/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=1605456710277851994' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1605456710277851994'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1605456710277851994'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/08/at-bizo-we-run-number-of-periodically.html' title='Report delivery from Hive via Google Spreadsheets'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-3872853802017753255</id><published>2011-08-12T16:15:00.001-07:00</published><updated>2011-08-12T16:17:34.084-07:00</updated><title type='text'>Bizo dev team @ TechShopSF</title><content type='html'>&lt;a href="http://www.flickr.com/photos/logrodnek/6033687422/" title="dev team screenprinting"&gt;&lt;img src="http://farm7.static.flickr.com/6121/6033687422_3592131df0.jpg" width="500" height="374" alt="IMG_0887"&gt;&lt;/a&gt;&lt;p&gt;Every quarter we have an "all hands" week, where the entire company comes to SF (the Bizo team is spread out across the country).&lt;/p&gt;&lt;p&gt;As part of this, we typically spend a day as a development team going over previous accomplishments and upcoming projects, as well as discussing our development process, architecture, etc.&lt;/p&gt;&lt;p&gt;We also spend some time making cool stuff!  Last time around we had an internal &lt;a href="http://www.arduino.cc/"&gt;Arduino&lt;/a&gt; workshop.  Each developer got an Arduino and various components, and we went through a bunch of exercises from &lt;a href="http://oreilly.com/catalog/9780596155520"&gt;Getting Started with Arduino&lt;/a&gt;.  We ended the day getting Wii controllers hooked up to our Arduinos (can't beat that).&lt;/p&gt;&lt;p&gt;This time around, we decided to head over to the SF &lt;a href="http://techshop.ws/"&gt;Techshop&lt;/a&gt; and learn how to screen print.&lt;/p&gt;&lt;p&gt;We ended up with some great shirts:&lt;/p&gt;&lt;a href="http://www.flickr.com/photos/logrodnek/6033688472/" title="bizo shirts"&gt;&lt;img src="http://farm7.static.flickr.com/6138/6033688472_161515f448.jpg" width="500" height="374" alt="IMG_0898"&gt;&lt;/a&gt;&lt;p&gt;They use a really cool process there, where you use a vinyl cutter to create a stencil for your artwork, which you can then just apply to your screen.&lt;/p&gt;&lt;p&gt;It was a lot of fun, and I think we all learned a lot.  Special thanks to our instructor, Liz, as well as Devon at TechShop for helping us get this set up.&lt;/p&gt;&lt;p&gt;Check out some more shirts in this &lt;a href="http://www.flickr.com/photos/logrodnek/sets/72157627411676456/with/6033129479/"&gt;photo set&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-3872853802017753255?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/3872853802017753255/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=3872853802017753255' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3872853802017753255'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3872853802017753255'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/08/bizo-dev-team-techshopsf.html' title='Bizo dev team @ TechShopSF'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm7.static.flickr.com/6121/6033687422_3592131df0_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-9142412288964028853</id><published>2011-06-01T15:02:00.001-07:00</published><updated>2011-06-01T15:03:24.970-07:00</updated><title type='text'>MongoSF follow-up and contest winners</title><content type='html'>Thanks for stopping by and meeting the Bizo engineering team atMongoSF.  We had a great time meeting everyone at the conference andthe after-party.  Stay in touch!As part of our conference sponsorship, we were able to include a small card in the conference bags.  &lt;a href="http://bizoneers.com"&gt;We are hiring&lt;/a&gt;, so we decided to use the space to talk a bit about the engineering team.  On the back of the card, we included a small puzzle (click though for the full version):&lt;a href="http://com-bizo-public.s3.amazonaws.com/blog/mongosf/2011/Bizo-Engineering-Side-B.pdf"&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/mongosf/2011/mipreview.png" /&gt;&lt;/a&gt;If you successfully completed the puzzle, you &lt;a href="http://www.bizoneers.com/mongosf2011/"&gt;landed here&lt;/a&gt;, with a chance to enter some pirate jokes and win an Amazon gift card.Congratulations to the winners:&lt;ul&gt;  &lt;li&gt;Y. Wayne H.&lt;/li&gt;  &lt;li&gt;Huy H.&lt;/li&gt;   &lt;li&gt;Dan N.&lt;/li&gt;&lt;/ul&gt;Like I said, we are hiring!  We're looking for smart, motivated people who get stuff done.  To learn more about the team, check out our &lt;a href="http://www.bizoneers.com/"&gt;mini engineering team site&lt;/a&gt;.  If you would like to talk more, please &lt;a href="mailto:dev-jobs@bizo.com"&gt;get in touch&lt;/a&gt;!Finally, here are some of our favorite pirate jokes from the contest:&lt;dl&gt;  &lt;dt&gt;&lt;b&gt;How does a pirate do calculus?&lt;/b&gt;&lt;/dt&gt;  &lt;dd&gt;By taking a derivative with respect to  Arrrrrrrrrrrrrrrrrrrr!!!&lt;/dd&gt;  &lt;dt&gt;&lt;b&gt;Where is the hidden treasure map of Silicon  Valley?&lt;/b&gt;&lt;/dt&gt;  &lt;dd&gt;Legend has it that Captain Zukarrrrburg hides it in his Subvarrrrsion Reparrrrsitory ol' matey!&lt;/dd&gt;&lt;/dl&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-9142412288964028853?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/9142412288964028853/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=9142412288964028853' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/9142412288964028853'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/9142412288964028853'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/06/mongosf-follow-and-contest-winners.html' title='MongoSF follow-up and contest winners'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7079889799673640888</id><published>2011-05-22T16:48:00.001-07:00</published><updated>2011-05-22T16:49:02.908-07:00</updated><title type='text'>Bizo @ MongoSF</title><content type='html'>&lt;img src="http://lh6.ggpht.com/_N7Nsdm8uf9k/Tdmgw47xfaI/AAAAAAAAACc/pgGXoPtnhlw/mongoSF_badge_210x140.png?imgmax=800" alt="MongoSF badge 210x140" title="mongoSF_badge_210x140.png" border="0" width="210" height="140" /&gt;This Tuesday, May 24th the &lt;a href="http://www.bizoneers.com/"&gt;Bizo dev team&lt;/a&gt; will be attending the &lt;a href="http://www.10gen.com/conferences/mongosf2011"&gt;MongoSF conference&lt;/a&gt;.  Hope to see you there.  We're also sponsoring the after-party at &lt;a href="http://maps.google.com/maps/place?hl=en&amp;ie=UTF8&amp;q=oz+lounge+sf&amp;fb=1&amp;gl=us&amp;hq=oz+lounge&amp;hnear=0x80859a6d00690021:0x4a501367f076adff,San+Francisco,+CA&amp;cid=9337283732821941052&amp;z=14"&gt;Oz Lounge&lt;/a&gt;.  Stop by, say "hello," and let us buy you a drink!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7079889799673640888?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7079889799673640888/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7079889799673640888' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7079889799673640888'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7079889799673640888'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/05/bizo-mongosf.html' title='Bizo @ MongoSF'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh6.ggpht.com/_N7Nsdm8uf9k/Tdmgw47xfaI/AAAAAAAAACc/pgGXoPtnhlw/s72-c/mongoSF_badge_210x140.png?imgmax=800' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-3625862707796762324</id><published>2011-05-19T11:44:00.001-07:00</published><updated>2011-05-19T11:44:32.023-07:00</updated><title type='text'>Cloudwatch custom metrics @ Bizo</title><content type='html'>Now that &lt;a href="http://aws.amazon.com/about-aws/whats-new/2011/05/10/amazon-cloudwatch-announces-custom-metrics-lower-prices-for-amazon-ec2-monitoring/"&gt;Cloudwatch Custom Metrics&lt;/a&gt; are live, I wanted to talk a bit about how we're using them here at Bizo.We've been heavy users of the existing metrics to track requests/machine counts/latency, etc. as &lt;a href="http://developer.bizo.com/documentation/10-performance-and-availability"&gt;seen here&lt;/a&gt;.  We wanted to start tracking more detailed application-specific metrics and were excited to learn about the beta custom metric support.&lt;h2&gt;Error Tracking&lt;/h2&gt;The first thing we decided to tackle tracking were application errors.  We were able to do this across our applications pretty much transparently by creating a custom &lt;a href="http://download.oracle.com/javase/6/docs/api/java/util/logging/Handler.html"&gt;java.util.logging.Handler&lt;/a&gt;.  Any application log message that crosses the specified level (typically SEVERE, or WARNING) will be logged to cloudwatch.&lt;img src="http://lh3.ggpht.com/_N7Nsdm8uf9k/TdVlDYbwMKI/AAAAAAAAACQ/hTFneOJAcrw/api-errors.png?imgmax=800" alt="Api errors" title="api-errors.png" border="0" width="598" height="233" /&gt;For error metrics, we use "ErrorCount" as the metric name, with the following dimensions:&lt;ul&gt;&lt;li&gt;Version, e.g. 124021 (svn revision number)&lt;/li&gt;&lt;li&gt;Stage, e.g. prod&lt;/li&gt;&lt;li&gt;Region, e.g. us-west-1&lt;/li&gt;&lt;li&gt;Application, e.g. api-web&lt;/li&gt;&lt;li&gt;InstanceId, e.g. i-201345a&lt;/li&gt;&lt;li&gt;class, e.g. com.sun.jersey.server.impl.application.WebApplicationImpl&lt;/li&gt;&lt;li&gt;exception, e.g. com.bizo.util.sdb.RuntimeDBException&lt;/li&gt;&lt;/ul&gt;Each application has its own cloudwatch namespace.This setup allows us to track error counts and rates across our applications/versions/regions, as well as get alerts when they reach specific thresholds.&lt;h2&gt;Other Application Metrics&lt;/h2&gt;We expose a simple MetricTracker interface in our applications:&lt;pre class="prettyprint"&gt;&lt;br /&gt;interface MetricTracker {&lt;br /&gt;  void track(String metricName, Number value, List&amp;lt;Dimension&amp;gt; dimensions);&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;The implementation handles internally buffering/aggregating the metric data, and then periodically sending batches of cloudwatch updates.This allows developers to quickly add tracking for any metric they want.  Note that with cloudwatch, there's no setup required, you just start tracking.&lt;h2&gt;Wishlist&lt;/h2&gt;It's incredibly easy to get up and running with cloudwatch, but it's not perfect.  There are a couple of things I'd like to see:&lt;ul&gt;&lt;li&gt;More data - CW only stores 2 weeks of data, which seems too short.&lt;/li&gt;&lt;li&gt;Faster - pulling data from CW (both command line and UI) can be really slow.&lt;/li&gt;&lt;li&gt;Better suport for multiple dimensions / drill down.&lt;/li&gt;&lt;/ul&gt;Cloudwatch does allow you to track against multiple dimensions, but it doesn't work as you'd probably expect.  They're really treated as a single dimension.  E.g. If you track against &lt;code&gt;stage=prod,version=123&lt;/code&gt;, you can ONLY retrieve stats by querying against &lt;code&gt;stage=prod,version=123&lt;/code&gt;.  Querying against &lt;code&gt;stage=prod&lt;/code&gt; only or &lt;code&gt;version=123&lt;/code&gt; only will not produce any results.You can work around this in your application, by submitting data for all permutations that you want to track against (our MetricTracker implementation works this way).  It would be great if couldwatch supported this more fully, including being able to drill down/up in the UI.&lt;h2&gt;Alternatives&lt;/h2&gt;We didn't invest too much time into exploring alternatives.  It seems like running an &lt;a href="http://opentsdb.net/"&gt;OpenTSDB&lt;/a&gt; cluster, or something like &lt;a href="https://github.com/etsy/statsd"&gt;statds&lt;/a&gt; would get you pretty far in terms of metric collection.  That's only part of the story though, you would also definitely want alerting, and possible service scaling based on your metrics.&lt;h2&gt;Overall Impressions&lt;/h2&gt;We continue to be excited about the custom metric support in Cloudwatch.  We were able to get up and running very quickly with useful reports and alarms based on our own application metrics.  For us, the clear advantage is that there's absolutely no setup, management or maintenance involved.  Additionally, the full integration into alarms, triggers, and the AWS console is very key.&lt;h2&gt;Future Use&lt;/h2&gt;We think that we may be able to get more efficient machine usage by triggering scaling events based on application metric data, so this is something we will continue to explore.It's easy to see how the error tracking we are doing can be integrated into a deployment system to allow for more automated rollout/rollback by tracking error rate changes based on version, so I definitely see us heading in that direction.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-3625862707796762324?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/3625862707796762324/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=3625862707796762324' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3625862707796762324'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3625862707796762324'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/05/cloudwatch-custom-metrics-bizo.html' title='Cloudwatch custom metrics @ Bizo'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh3.ggpht.com/_N7Nsdm8uf9k/TdVlDYbwMKI/AAAAAAAAACQ/hTFneOJAcrw/s72-c/api-errors.png?imgmax=800' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-797328810751032928</id><published>2011-05-16T09:43:00.000-07:00</published><updated>2011-05-16T13:40:05.907-07:00</updated><title type='text'>Synchronizing Stashboard with Pingdom alerts</title><content type='html'>&lt;div style="text-align: left;"&gt;First, what's &lt;a href="http://www.stashboard.org/"&gt;Stashboard&lt;/a&gt;? It's is an open-source status page for cloud services and APIs.  Here's a basic example:&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-n9qswoKTyr4/TdGAfs_0XCI/AAAAAAAAABI/wYwXSzqy05Y/s1600/stashboard1.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5607404293196110882" src="http://2.bp.blogspot.com/-n9qswoKTyr4/TdGAfs_0XCI/AAAAAAAAABI/wYwXSzqy05Y/s320/stashboard1.png" style="cursor: pointer; display: block; height: 110px; margin-bottom: 10px; margin-left: auto; margin-right: auto; margin-top: 0px; text-align: center; width: 320px;" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Alright, now what's &lt;a href="http://www.pingdom.com/"&gt;Pingdom&lt;/a&gt;?  It's a commercial service for monitoring cloud services and APIs.  You define how to "ping" a service, and Pingdom periodically checks if the service is responding to the ping request and if not, sends email or SMS alerts.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-q_S0W6NGRL4/TdGBBa1n64I/AAAAAAAAABQ/OWME4b7eTXk/s1600/pingdom1.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5607404872437066626" src="http://4.bp.blogspot.com/-q_S0W6NGRL4/TdGBBa1n64I/AAAAAAAAABQ/OWME4b7eTXk/s320/pingdom1.png" style="cursor: pointer; display: block; height: 248px; margin-bottom: 10px; margin-left: auto; margin-right: auto; margin-top: 0px; text-align: center; width: 320px;" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;See the connection?  At Bizo, we've had Stashboard deployed on Google's AppEngine for a while but we were updating the status of services manually -- only when major outages happened.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Recently, we've been wanting for something more automated and so we decided to synchronize Stashboard status with Pingdom's notification history and came out with the following requirements:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Synchronize Stashboard within 15 minutes of Pingdom's alert.&lt;/li&gt;&lt;li&gt;"Roll-up" several Pingdom alerts into a single Stashboard status (i.e., for a given service, we have several Pingdom alerts covering different regions around the world but we only want to show a single service status in Stashboard)&lt;/li&gt;&lt;li&gt;If any of the related Pingdom alerts indicate a service is currently unavailable, show "&lt;i&gt;Service is currently down&lt;/i&gt;" status.&lt;/li&gt;&lt;li&gt;If the service is available but there have been any alerts in the past 24 hours, show "&lt;i&gt;Service is experiencing intermittent problems&lt;/i&gt;" status.&lt;/li&gt;&lt;li&gt;Otherwise, display "&lt;i&gt;Service is up&lt;/i&gt;" status.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;There are several ways we could have implemented this.  We initially thought about using AppEngine's &lt;a href="http://code.google.com/appengine/docs/python/mail/"&gt;Python Mail API&lt;/a&gt; but decided against it since we're not familiar enough with Python and we didn't want to customize Stashboard from the inside.  We ended up doing an integration "from the outside" using a cron job and a Ruby script that uses the &lt;a href="https://github.com/smulube/stashboard-ruby"&gt;stashboard&lt;/a&gt; and the &lt;a href="https://github.com/mtodd/pingdom-client"&gt;pingdom-client&lt;/a&gt; gems.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It was actually pretty simple.   To connect to both services,&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;pre class="brush:ruby"&gt;require 'pingdom-client'&lt;br /&gt;require 'stashboard'&lt;br /&gt;&lt;br /&gt;pingdom = Pingdom::Client.new pingdom_auth.merge(:logger =&amp;gt; logger)&lt;br /&gt;&lt;br /&gt;stashboard = Stashboard::Stashboard.new(&lt;br /&gt;  stashboard_auth[:url],&lt;br /&gt;  stashboard_auth[:oauth_token],&lt;br /&gt;  stashboard_auth[:oauth_secret]&lt;br /&gt;)&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;then define the mappings between our Pingdom alerts and Stashboard services using a hash of regular expressions,&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;pre class="brush:ruby"&gt;# Stashboard service id =&amp;gt; Regex matching pingdom check name(s)&lt;br /&gt;services = {&lt;br /&gt;  'api' =&amp;gt; /api/i,&lt;br /&gt;  'analyze' =&amp;gt; /analyze/i,&lt;br /&gt;  'self-service' =&amp;gt; /bizads/i,&lt;br /&gt;  'data-collector' =&amp;gt; /data collector/i&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;div&gt;and iterate over all all Pingdom alerts and for each mapping determine if the service is either up or has had alerts in the past 24 hours,&lt;/div&gt;&lt;br /&gt;&lt;pre class="brush:ruby"&gt;up_services = services&lt;br /&gt;warning_services = {}&lt;br /&gt;&lt;br /&gt;# Synchronize recent pingdom outages over to stashboard&lt;br /&gt;# and determine which services are currently up.&lt;br /&gt;pingdom.checks.each do |check|&lt;br /&gt;  service = services.keys.find do |service|&lt;br /&gt;    regex = services[service]&lt;br /&gt;    check.name =~ regex&lt;br /&gt;  end&lt;br /&gt;  next unless service&lt;br /&gt;  &lt;br /&gt;  # check if any outages in past 24 hours&lt;br /&gt;  yesterday = Time.now - 24.hours&lt;br /&gt;  recent_outages = check.summary.outages.select do |outage|&lt;br /&gt;    outage.timefrom &amp;gt; yesterday || outage.timeto &amp;gt; yesterday&lt;br /&gt;  end&lt;br /&gt;  &lt;br /&gt;  # synchronize outage if necessary&lt;br /&gt;  recent_events = stashboard.events(service, "start" =&amp;gt; yesterday.strftime("%Y-%m-%d"))&lt;br /&gt;  recent_outages.each do |outage|&lt;br /&gt;    msg = "Service #{check.name} unavailable: " +&lt;br /&gt;    "#{outage.timefrom.strftime(TIME_FORMAT)} - #{outage.timeto.strftime(TIME_FORMAT)}"&lt;br /&gt;    unless recent_events.any? { |event| event["message"] == msg }&lt;br /&gt;      stashboard.create_event(service, "down", msg)&lt;br /&gt;    end&lt;br /&gt;  end&lt;br /&gt;  &lt;br /&gt;  # if service has recent outages, display warning&lt;br /&gt;  unless recent_outages.empty?&lt;br /&gt;    up_services.delete(service)&lt;br /&gt;    warning_services[service] = true&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  # if any pingdom check fails for a given service, consider the service down.&lt;br /&gt;  up_services.delete(service) if check.status == "down"&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;div&gt;Lastly, if any services are up or should indicate a warning then we update their status accordingly,&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre class="brush:ruby"&gt;up_services.each_key do |service|&lt;br /&gt;  current = stashboard.current_event(service)&lt;br /&gt;  if current["message"] =~ /(Service .* unavailable)|(Service operational but has experienced outage)/i&lt;br /&gt;    stashboard.create_event(service, "up", "Service operating normally.")&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;warning_services.each_key do |service|&lt;br /&gt;  current = stashboard.current_event(service)&lt;br /&gt;  if current["message"] =~ /Service .* unavailable/i&lt;br /&gt;    stashboard.create_event(service, "warning", "Service operational but has experienced outage(s) in past 24 hours.")&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;div&gt;Note that any manually-entered Stashboard status messages &lt;b&gt;will not&lt;/b&gt; be changed unless they match any of the automated messages or if there is a new outage reported by Pingdom.  This is intentional to allow overriding automated updates if for any reason, some kind of failure isn't accurately reported.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Curious about what the end result looks like?  Take a look at &lt;a href="http://status.bizo.com/"&gt;Bizo's status dashboard&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;a href="http://4.bp.blogspot.com/-RJpRbZjFd3o/TdF9FvzStFI/AAAAAAAAABA/U4wf8KCq9Us/s1600/stashboard2.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5607400548737397842" src="http://4.bp.blogspot.com/-RJpRbZjFd3o/TdF9FvzStFI/AAAAAAAAABA/U4wf8KCq9Us/s320/stashboard2.png" style="cursor: pointer; display: block; height: 189px; margin-bottom: 10px; margin-left: auto; margin-right: auto; margin-top: 0px; text-align: center; width: 320px;" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;When you click on a specific service, you can see individual outages,&lt;/div&gt;&lt;div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-d6EWYZaY4RI/TdGLf6IwDxI/AAAAAAAAABY/Z2WFP4Fnm8E/s1600/stashboard3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="219" src="http://4.bp.blogspot.com/-d6EWYZaY4RI/TdGLf6IwDxI/AAAAAAAAABY/Z2WFP4Fnm8E/s320/stashboard3.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We hope this is useful to somebody out there... and big thanks to the Stashboard authors at Twilio, Matt Todd for creating the pingdom-client gem and Sam Mulube for the stashboard gem.  You guys rule!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;PS: You can download the full Ruby script from &lt;a href="https://gist.github.com/975141"&gt;https://gist.github.com/975141&lt;/a&gt;.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: helvetica, arial, freesans, clean, sans-serif; font-size: 14px; line-height: 20px;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-797328810751032928?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/797328810751032928/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=797328810751032928' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/797328810751032928'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/797328810751032928'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/05/synchronizing-stashboard-with-pingdom.html' title='Synchronizing Stashboard with Pingdom alerts'/><author><name>Alex Boisvert</name><uri>http://www.blogger.com/profile/05164682765137205886</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-n9qswoKTyr4/TdGAfs_0XCI/AAAAAAAAABI/wYwXSzqy05Y/s72-c/stashboard1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-6822468239273199684</id><published>2011-04-21T12:52:00.000-07:00</published><updated>2011-04-22T16:54:33.402-07:00</updated><title type='text'>How Bizo survived the Great AWS Outage of 2011 relatively unscathed...</title><content type='html'>The &lt;a href="http://twitter.com/#!/search/AWS%20outage"&gt;twittersphere&lt;/a&gt;, &lt;a href="http://newenterprise.allthingsd.com/20110421/amazons-cloud-crashed-overnight-and-brought-several-other-companies-down-too/"&gt;techblogs&lt;/a&gt; and even some &lt;a href="http://money.cnn.com/2011/04/21/technology/amazon_server_outage/"&gt;business sites&lt;/a&gt; are a buzz with the news that the US East Region of AWS has been experiencing a major outage.  This outage has taken down some of the most well known names on the web.  Bizo's infrastructure is 100% AWS and we support 1000s of publisher sites (including some very well know business sites) doing billions of impressions a month.  Sure, we had a few bruises early yesterday morning when the outage first began, but soon after then we've been operating our core, high volume services on top of AWS but without the East region.&lt;br /&gt;&lt;p&gt;&lt;br /&gt;Here is how we have remained up despite not having a single ops person on our engineering team:&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;1) Our services are well monitored&lt;br /&gt;We rely on &lt;a href="http://pingdom.com/"&gt;pingdom&lt;/a&gt; for external verifcation of site availability on a world wide basis.  Additionally, we have our own internal alarms and dashboards that give us up to the minute metrics such as request rate, cpu utilization etc.  Most of this data comes from AWS Cloudwatch monitoring but we also track error ratesand have alarms setup to alert us when these rates change or go over a certain threshold. &lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;2) Our services have circuit breakers between remote services that trip when other services become unavailable and we heavily cache data&lt;br /&gt;When building our services, we always assume that remote services will fail at some point.  We've spend a good deal of time investing in minimizing the domino effect of a failing remote service.  When a remote service becomes unavailable the caller detects this and will go into tripped mode occasionally retrying with backoffs.  Of course we also rely on caching read-only data heavily and are able to take advantage of the fact that the data needed for most of our services does not change very often. &lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;3) We utilize autoscaling&lt;br /&gt;One of the promises of AWS is the ability to start and stop more servers based on traffic and load.  We've been using autoscaling since it was launched and it worked like a charm.  You can see the instances starting up based on the new load in the US West region as traffic was diverted over from US East.&lt;br /&gt;&lt;br /&gt;&lt;img src="https://com-bizo-public.s3.amazonaws.com/blog/dc-instances.jpg" style="display:block;"/&gt;&lt;br /&gt;&lt;span style="font-size: 10px; font-weight:bold"&gt;(all times UTC)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;4) Our architecture is designed to let us funnel traffic around an entire region if necessary&lt;br /&gt;We utilize &lt;a href="http://dev.bizo.com/2010/05/improving-global-application.html"&gt;Global Load Balancing&lt;/a&gt; to direct traffic to the closest region based on the end-user's location.  For instance, if a user is in California, wedirect their traffic to the US West region.  This was extremely valuable in keeping us fully functioning in the face of a regional outage.  When we finally decided that the US East region was going to cause major issues, switching all traffic to US West was as easy as clicking a few buttons.  You can see how the requests transitioned over quickly after we made the decision.  (By the way, quick shout-out to Dynect who is our GSLB service provider.  Thanks!)&lt;br /&gt;&lt;br /&gt;&lt;img src="https://com-bizo-public.s3.amazonaws.com/blog/dc-requests.jpg" style="display:block;"/&gt;&lt;br /&gt;&lt;span style="font-size: 10px;  font-weight:bold"&gt;(all times UTC)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;Bumps and Bruises&lt;br /&gt;Of course we didn't escape without sustaining some issues.  We'll do another blog post on some of the issues we did run into but they were relatively minor.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;Conclusion&lt;br /&gt;After 3 years running full time on AWS across 4 regions and 8 availability zones we design our systems with the assumption that failure will happen and it helped us come through this outage relatively unscathed.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-6822468239273199684?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/6822468239273199684/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=6822468239273199684' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6822468239273199684'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6822468239273199684'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/04/how-bizo-survived-great-aws-outage-of.html' title='How Bizo survived the Great AWS Outage of 2011 relatively unscathed...'/><author><name>Donnie</name><uri>http://www.blogger.com/profile/13599133732419522440</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-6156770367087625710</id><published>2011-04-18T09:44:00.000-07:00</published><updated>2011-04-18T09:44:03.714-07:00</updated><title type='text'>Command Query Responsibility Segregation with S3 and JSON</title><content type='html'>&lt;p&gt;We recently tackled a problem at &lt;a href='http://www.bizo.com'&gt;Bizo&lt;/a&gt; where we wanted to decouple our high-volume servers from our MySQL database.&lt;/p&gt;&lt;p&gt;While considering different options (NoSQL vs. MySQL, etc.), in retrospect we ended up implementing a SOA-version of the &lt;a href='http://en.wikipedia.org/wiki/Command-query_separation'&gt;Command Query Separation&lt;/a&gt; pattern (or &lt;a href='http://codebetter.com/gregyoung/2009/08/13/command-query-separation/'&gt;Command Query Responsibility Segregation&lt;/a&gt;, which is services/messaging-specific).&lt;/p&gt;&lt;p&gt;Briefly, in our new approach, queries (reads) use an in-memory cache that is bulk loaded and periodically reloaded from a snapshot of the data stored as JSON in S3. Commands (writes) are HTTP calls to a remote JSON API service. MySQL is still the authoritative database, we just added a layer of decoupling for both reads and writes.&lt;/p&gt;&lt;p&gt;This meant our high-volume servers now have:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;No reliance on MySQL availability or schema&lt;/li&gt;&lt;li&gt;No wire calls blocking the request thread (except a few special requests)&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The rest of this post explains our context and elaborates on the approach.&lt;/p&gt;&lt;h3 id='prior_approach_cached_jpa_calls'&gt;Prior Approach: Cached JPA Calls&lt;/h3&gt;&lt;p&gt;For context, our high-volume servers rely on configuration data that is stored in a MySQL database. Of course, the configuration data that doesn&amp;#8217;t have to be absolutely fresh, so we&amp;#8217;d already been using caching to avoid constantly pounding the database for data that rarely changes.&lt;/p&gt;&lt;img src='http://draconianoverlord.com/images/remote-api-before.png' style='width: 400px; margin: auto; display: block;' /&gt;&lt;p&gt;There were several things we liked about this approach:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;We use &lt;a href='http://aws.amazon.com/rds/'&gt;Amazon RDS&lt;/a&gt; for the MySQL instance, which provides out-of-the-box backups, master/slave configuration, etc., and is generally a pleasure to use. We enjoy not running our own database servers.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;We also have several low-volume internal and customer-facing web applications that maintain the same data and are perfectly happy talking to a SQL database. They are normal, chatty CRUD applications for which the tool support and ACID-sanity of a SQL database make life a lot easier.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;That being said, we wanted to tweak a few things:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;Reduce the high-volume servers&amp;#8217; reliance on MySQL for seeding their cache.&lt;/p&gt;&lt;p&gt;Although RDS is great, and definitely more stable than our own self-maintained instances would be, there are nonetheless limits on its capacity. Especially if one of our other application misbehaves (which has never happened&amp;#8230;&lt;em&gt;cough&lt;/em&gt;), it can degrade the MySQL instance to the point of negatively affecting the high-volume servers.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Reduce cache misses that block the request thread.&lt;/p&gt;&lt;p&gt;Previously, configuration data (keyed by a pre-request configuration id) was not pulled into cache until it was needed. The first request (after every cache flush) would reload the data for it&amp;#8217;s configuration id from MySQL and repopulate the cache.&lt;/p&gt;&lt;p&gt;While not initially a big deal, as Bizo has grown, we&amp;#8217;re now running in multiple AWS regions, and cache misses require a cross-region JDBC call to fetch their data from the MySQL server running in us-east.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Illustrated in code, our approach had, very simplified, been:&lt;/p&gt;&lt;pre class='brush:java'&gt;class TheServlet {&lt;br /&gt;  public void doGet() {&lt;br /&gt;    int configId = request.getParameter(&amp;quot;configId&amp;quot;);&lt;br /&gt;    Config config = configService.getConfig(configId);&lt;br /&gt;    // continue processing with config settings&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;class ConfigService {&lt;br /&gt;  // actually thread-safe/ehcache-managed, flushed every 30m&lt;br /&gt;  Map&amp;lt;Integer, Config&amp;gt; cached = new HashMap&amp;lt;Integer, Config&amp;gt;();&lt;br /&gt;&lt;br /&gt;  public Config getConfig(int configId) {&lt;br /&gt;    Config config = cached.get(configId);&lt;br /&gt;    if (config == null) {&lt;br /&gt;      // hit mysql for the data, blocks the request thread&lt;br /&gt;      config = configJpaRepository.find(configId);&lt;br /&gt;      // cache it&lt;br /&gt;      cached.put(configId, config);&lt;br /&gt;    }&lt;br /&gt;    return config;&lt;br /&gt;  }&lt;br /&gt;}&lt;/pre&gt;&lt;h3 id='potential_big_data_approaches'&gt;Potential &amp;#8220;Big Data&amp;#8221; Approaches&lt;/h3&gt;&lt;p&gt;Given our primary concern was MySQL being a single point of failure, we considered moving to a new database platform, e.g. SimpleDB, Cassandra, or the like, all of which can scale out across machines.&lt;/p&gt;&lt;p&gt;Of course, RDS&amp;#8217;s master/slave MySQL setup already reduces its risk of single machine point of failure, but the RDS master/slave cluster as a whole is still, using the term loosely, a &amp;#8220;single point&amp;#8221;. Granted, with this very loose definition, there will always be some &amp;#8220;point&amp;#8221; you rely on&amp;#8211;we just wanted one that we felt more comfortable with than MySQL.&lt;/p&gt;&lt;p&gt;Anyway, for NoSQL options, we couldn&amp;#8217;t get over the cons of:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;Having to run our own clusters (except for SimpleDB).&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Having to migrate our low-volume CRUD webapps over to the new, potentially slow (SimpleDB), potentially eventually-consistent (Cassandra) NoSQL back-end.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Still having cache misses result in request threads blocking on wire calls.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Because of these cons, we did not put a lot of effort into researching NoSQL approaches for this problem&amp;#8211;we felt it was fairly apparent they weren&amp;#8217;t necessary.&lt;/p&gt;&lt;h3 id='realization_mysql_is_fine_fix_the_cache'&gt;Realization: MySQL is Fine, Fix the Cache&lt;/h3&gt;&lt;p&gt;Of course, we really didn&amp;#8217;t have a Big Data problem (well, we do &lt;a href='http://hadoop.apache.org/'&gt;have&lt;/a&gt; &lt;a href='http://wiki.apache.org/hadoop/Hive'&gt;a&lt;/a&gt; &lt;a href='http://aws.amazon.com/elasticmapreduce/'&gt;lot&lt;/a&gt; of &lt;a href='http://en.wikipedia.org/wiki/OLAP_cube'&gt;those&lt;/a&gt;, but not for this problem).&lt;/p&gt;&lt;p&gt;We just had a cache seeding problem. Specifically:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;All of our configuration data can fit in RAM, so we should be able to bulk-load all of it at once&amp;#8211;no more expensive, blocking wire calls on cache misses (basically there are no cache misses anymore).&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;We can load the data from a more reliable, non-authoritative, non-MySQL data store&amp;#8211;e.g. an S3 snapshot (&lt;code&gt;config.json.gz&lt;/code&gt;) of the configuration data.&lt;/p&gt;&lt;p&gt;The S3 file then basically becomes our alternative &amp;#8220;query&amp;#8221; database in the CQRS pattern.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;When these are put together, a solution emerges where we can have a in-memory, always-populated cache of the configuration data that is refreshed by a background thread and results in request threads never blocking.&lt;/p&gt;&lt;p&gt;In code, this looks like:&lt;/p&gt;&lt;pre class='brush:java'&gt;class TheServlet {&lt;br /&gt;  public void doGet() {&lt;br /&gt;    // note: no changes from before, which made migrating easy&lt;br /&gt;    int configId = request.getParameter(&amp;quot;configId&amp;quot;);&lt;br /&gt;    Config config = configService.getConfig(configId);&lt;br /&gt;    // continue processing with config settings&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;class ConfigService {&lt;br /&gt;  // the current cache of all of the config data&lt;br /&gt;  AtomicReference&amp;lt;Map&amp;gt; cached = new AtomicReference();&lt;br /&gt;&lt;br /&gt;  public void init() {&lt;br /&gt;    // use java.util.Timer to refresh the cache&lt;br /&gt;    // on a background thread&lt;br /&gt;    new Timer(true).schedule(new TimerTask() {&lt;br /&gt;      public void run() {&lt;br /&gt;        Map newCache = reloadFromS3(&amp;quot;bucket/config.json.gz&amp;quot;);&lt;br /&gt;        cached.set(newCache);&lt;br /&gt;      }&lt;br /&gt;    }, 0, TimeUnit.MINUTES.toMillis(30));&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  public Config getConfig(int configId) {&lt;br /&gt;    // now always return whatever is in the cache--if a&lt;br /&gt;    // configId isn&amp;#39;t present, that means it was not in&lt;br /&gt;    // the last S3 file and is treated the same as it&lt;br /&gt;    // not being in the MySQL database previously&lt;br /&gt;    Map currentCache = cached.get();&lt;br /&gt;    if (currentCache == null) {&lt;br /&gt;      return null; // data hasn&amp;#39;t been loaded yet&lt;br /&gt;    } else {&lt;br /&gt;      return currentCache.get(configId);&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  private Map reloadFromS3(String path) {&lt;br /&gt;    // uses AWS SDK to load the data from S3&lt;br /&gt;    // and Jackson to deserialize it to a map&lt;br /&gt;  }&lt;br /&gt;}&lt;/pre&gt;&lt;h3 id='a_few_wrinkles_realtime_reads_and_writes'&gt;A Few Wrinkles: Real-Time Reads and Writes&lt;/h3&gt;&lt;p&gt;So far I&amp;#8217;ve only talked about the cached query/reads side of the new approach. We also had two more requirements:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;Very (very) infrequently, a high-volume server will need real-time configuration data to handle a special request.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The high-volume servers occasionally write configuration/usage stats back to the MySQL database.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;While we could have continued using a MySQL/JDBC connection for these few requests, this also provided the opportunity to build a JSON API in front of the MySQL database. This was desirable for two main reasons:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;It decoupled our high-volume services from our MySQL schema. By still honoring the JSON API, we could upgrade the MySQL schema and the JSON API server at the same time with a much smaller, much less complicated downtime window than with the high-volume services talking directly to the MySQL schema.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The MySQL instance is no longer being accessed across AWS regions, so can have much tighter firewall rules, which only allow the JSON API server (that is within its same us-east region) access it.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The new setup looks basically like:&lt;/p&gt;&lt;img src='http://draconianoverlord.com/images/remote-api-after.png' style='width: 400px; margin: auto; display: block;' /&gt;&lt;h3 id='scalatra_servlet_example'&gt;Scalatra Servlet Example&lt;/h3&gt;&lt;p&gt;With &lt;a href='http://jackson.codehaus.org/'&gt;Jackson&lt;/a&gt; and &lt;a href='https://github.com/scalatra/scalatra'&gt;Scalatra&lt;/a&gt;, the JSON API server was trivial to build, especially since it could reuse the same JSON DTO objects that are also serialized in the &lt;code&gt;config.json.gz&lt;/code&gt; file in S3.&lt;/p&gt;&lt;p&gt;As an example for how simple Jackson and Scalatra made writing the JSON API, here is the code for serving real-time request requests:&lt;/p&gt;&lt;pre class='brush:scala'&gt;class JsonApiService extends ScalatraServlet {&lt;br /&gt;  get(&amp;quot;/getConfig&amp;quot;) {&lt;br /&gt;    // config is the domain object fresh from MySQL&lt;br /&gt;    val config = configRepo.find(params(&amp;quot;configId&amp;quot;).toLong)&lt;br /&gt;    // configDto is just the data we want to serialize&lt;br /&gt;    val configDto = ConfigMapper.toDto(configDto)&lt;br /&gt;    // jackson magic to make json&lt;br /&gt;    val json = jackson.writeValueAsString(configDto)&lt;br /&gt;    json&lt;br /&gt;  }&lt;br /&gt;}&lt;/pre&gt;&lt;h3 id='background_writes'&gt;Background Writes&lt;/h3&gt;&lt;p&gt;The final optimization was realizing that, when the high-volume servers have requests that trigger stats to be written to MySQL, for our requirements, these writes aren&amp;#8217;t critical.&lt;/p&gt;&lt;p&gt;This means there is no need to perform them on the request-serving thread. Instead, we can push the writes onto a queue and have it fulfilled by a background thread.&lt;/p&gt;&lt;p&gt;This generally looks like:&lt;/p&gt;&lt;pre class='brush:java'&gt;class ConfigWriteService {&lt;br /&gt;  // create a background thread pool of (for now) size 1&lt;br /&gt;  private ExecutorService executor = new ThreadPoolExector(...);&lt;br /&gt;&lt;br /&gt;  // called by the request thread, won&amp;#39;t block&lt;br /&gt;  public void writeUsage(int configId, int usage) {&lt;br /&gt;    offer(&amp;quot;https://json-api-service/writeUsage?configId=&amp;quot; +&lt;br /&gt;      configId +&lt;br /&gt;      &amp;quot;&amp;amp;usage=&amp;quot; +&lt;br /&gt;      usage);&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  private void offer(String url) {&lt;br /&gt;    try {&lt;br /&gt;      executor.submit(new BackgroundWrite(url));&lt;br /&gt;    } catch (RejectedExecutionException ree) {&lt;br /&gt;      // queue full, writes aren&amp;#39;t critical, so ignore&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  private static class BackgroundWrite implements Runnable {&lt;br /&gt;    private String url;&lt;br /&gt;&lt;br /&gt;    private BackgroundWrite(String url) {&lt;br /&gt;      this.url = url;&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    public void run() {&lt;br /&gt;      // make call using commons-http to url&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;}&lt;/pre&gt;&lt;h3 id='tldr_we_implemented_command_query_responsibility_segregation'&gt;tl;dr We Implemented Command Query Responsibility Segregation&lt;/h3&gt;&lt;p&gt;With changing only a minimal amount of code in our high-volume servers, we were able to:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;Have queries (most reads) use cached, always-loaded data is that periodically reloaded from data snapshots in S3 (a more reliable source than MySQL)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Have commands (writes) sent from a background-thread to a JSON API that saves the data to MySQL and hides JDBC schema changes.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;For this configuration data, and our current requirements, MySQL, augmented with a more aggressive, Command Query Separation-style caching schema, has and continues to work well.&lt;/p&gt;&lt;p&gt;For more reading on CQS/CQRS, I suggest:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;Both the &lt;a href='http://en.wikipedia.org/wiki/Command-query_separation'&gt;Wikipedia&lt;/a&gt; article and Martin Fowler&amp;#8217;s &lt;a href='http://www.martinfowler.com/bliki/CommandQuerySeparation.html'&gt;CommandQuerySeparation&lt;/a&gt;, however they focus on CQS as applied to OO, e.g. side-effect free vs. mutating method calls.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;For CQS applied to services, e.g. CQRS, &lt;a href='http://www.udidahan.com/2009/12/09/clarified-cqrs/'&gt;Udi Dahan&lt;/a&gt; seems to be one of the first advocates of the term. Since then, CQRS even seems to have it&amp;#8217;s own &lt;a href='http://cqrsinfo.com/'&gt;site&lt;/a&gt; and &lt;a href='http://groups.google.com/group/dddcqrs'&gt;google group&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-6156770367087625710?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/6156770367087625710/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=6156770367087625710' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6156770367087625710'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6156770367087625710'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/04/command-query-responsibility.html' title='Command Query Responsibility Segregation with S3 and JSON'/><author><name>Stephen Haberman</name><uri>http://www.blogger.com/profile/05412274950722949930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-5529269733211884395</id><published>2011-04-15T13:25:00.000-07:00</published><updated>2011-04-16T10:08:25.495-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='crowdflower'/><title type='text'>Crowdflower Trickery: dynamic tasks</title><content type='html'>Here at Bizo, we often use crowdflower to improve the quality of our data. In doing so, we’ve come across some cool, but  under-documented, tricks. One trick that we’ve particularly found useful is using &lt;a href="https://github.com/tobi/liquid/wiki/liquid-for-designers"&gt;liquid for designers&lt;/a&gt; to dynamically generate crowdflower tasks. Let us take a look at how to do this with a toy example.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0); background-color: transparent; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; font-family: Arial; "&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Problem: &lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;We are given a list of deserts from a bakery as shown in &lt;/span&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;i&gt;data.csv&lt;/i&gt;&lt;/span&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt; bellow . Our task is to determine the following:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;div style="background-color: transparent; "&gt;&lt;span style="color: rgb(0, 0, 0); background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;1. If the desert is a cake, is it appropriate for:&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; "&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;Wedding events&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;Birthday events&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;Casual eating&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="color: rgb(0, 0, 0); background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;i&gt;&lt;b&gt;2. If the desert is &lt;/b&gt;&lt;/i&gt;&lt;u&gt;&lt;i&gt;&lt;b&gt;NOT&lt;/b&gt;&lt;/i&gt;&lt;/u&gt;&lt;i&gt;&lt;b&gt; a cake, is it appropriate for:&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; "&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;eating using hands&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;eating using forks&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap; "&gt;eating using spoons&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; "&gt;&lt;span class="Apple-style-span" style="font-family: 'Times New Roman'; "&gt;&lt;b&gt;file: data.csv&lt;/b&gt;&lt;/span&gt;&lt;code&gt;&lt;br /&gt;"id","name","desert_type"&lt;br /&gt;0,"Peanut Butter Cake","cake"&lt;br /&gt;1,"Strawberry Donut","donut"&lt;br /&gt;2,"Cookies and crème ice cream cake","cake"&lt;br /&gt;3,"Apple Strudle","strudle"&lt;br /&gt;4,"Chocolate Pie","pie"&lt;br /&gt;5,"Red Velvet Cake","cake"&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Initial Approach:&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;One approach is to create two separate jobs: one which will ask workers questions relevant to cakes and one which will ask the workers questions relevant to non-cake deserts. &lt;/span&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;To create these jobs we would first have to break up our data into two data files: one containing data only for cakes (see cakes.csv) and one for everything else (see non-cakes.csv).&lt;/span&gt;&lt;span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; "&gt; &lt;/span&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Then, we would have to specify the appropriate cml code for each job. For our example, the data and cml code will look like:&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; "&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center; background-color: transparent; font-family: 'Times New Roman'; "&gt;&lt;span class="Apple-style-span" style="white-space: pre-wrap; "&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Job 1: Cake deserts&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;b&gt;file: &lt;/b&gt;cake.csv&lt;code&gt;&lt;br /&gt;"id","name","desert_type"&lt;br /&gt;0,"Peanut Butter Cake","cake"&lt;br /&gt;2,"Cookies and crème ice cream cake","cake"&lt;br /&gt;5,"Red Velvet Cake","cake"&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;b&gt;file:&lt;/b&gt; cake.cml&lt;code&gt;&lt;br /&gt;Desert: {{name}}&lt;br /&gt;&amp;lt;cml:checkboxes label="This desert appropriate for"&amp;gt;&lt;br /&gt;&amp;lt;cml:checkbox label="wedding events"/&amp;gt;&lt;br /&gt;&amp;lt;cml:checkbox label="birthday events"/&amp;gt;&lt;br /&gt;&amp;lt;cml:checkbox label="casual eating"/&amp;gt;&lt;br /&gt;&amp;lt;/cml:checkboxes&amp;gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;div&gt;&lt;div style="text-align: center; background-color: transparent; font-family: 'Times New Roman'; "&gt;&lt;span class="Apple-style-span" style="white-space: pre-wrap; "&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Job 2: Non-cake deserts&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;b&gt;&lt;/b&gt;&lt;b&gt;&lt;/b&gt;&lt;b&gt;&lt;/b&gt;&lt;b&gt;file: &lt;/b&gt;non-cake.csv&lt;code&gt;&lt;br /&gt;"id","name","desert_type"&lt;br /&gt;1,"Strawberry Donut","donut"&lt;br /&gt;3,"Apple Strudle","strudle"&lt;br /&gt;4,"Chocolate Pie","pie"&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;b&gt;file: &lt;/b&gt;non-cake.cml&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;Desert: {{name}}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;&amp;lt;cml:checkboxes label="This desert appropriate for"&amp;gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;&amp;lt;cml:checkbox label="eating using hands"/&amp;gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;&amp;lt;cml:checkbox label="eating using forks"/&amp;gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;&amp;lt;cml:checkbox label="eating using spoons"/&amp;gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: monospace; font-size: 13px; white-space: pre; "&gt;&amp;lt;/cml:checkboxes&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;b&gt;Approach using Liquid:&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Using liquid, we can solve the same problem with just &lt;b&gt;one job&lt;/b&gt; rather than two. To do so, we simply embed liquid logic tags into our cml code to dynamically display tasks to users -- the appropriate checkboxes will appear based on the "desert_type" field. Furthermore, we do &lt;b&gt;not&lt;/b&gt; need to break up the data.csv file.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;cml code&lt;/b&gt;&lt;/div&gt;&lt;code&gt;Desert: {{name}}&lt;br /&gt;&amp;lt;cml:checkboxes label="This desert is appropriate for"&amp;gt;&lt;br /&gt;{% if desert_type=='cake' %}&lt;br /&gt;&amp;lt;cml:checkbox label="wedding events"/&amp;gt;&lt;br /&gt;&amp;lt;cml:checkbox label="birthday events"/&amp;gt;&lt;br /&gt;&amp;lt;cml:checkbox label="casual eating"/&amp;gt;&lt;br /&gt;{% else %}&lt;br /&gt;&amp;lt;cml:checkbox label="eating using hands"/&amp;gt;&lt;br /&gt;&amp;lt;cml:checkbox label="eating using forks"/&amp;gt;&lt;br /&gt;&amp;lt;cml:checkbox label="casual using spoons"/&amp;gt;&lt;br /&gt;{% endif %}&lt;br /&gt;&amp;lt;/cml:checkboxes&amp;gt;&lt;/code&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-5529269733211884395?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/5529269733211884395/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=5529269733211884395' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5529269733211884395'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5529269733211884395'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/04/crowdflower-trickery.html' title='Crowdflower Trickery: dynamic tasks'/><author><name>Tony</name><uri>http://www.blogger.com/profile/03721589851872226027</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7793631849323213500</id><published>2011-04-14T19:39:00.000-07:00</published><updated>2011-04-14T19:39:59.688-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unit testing'/><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><title type='text'>Hive Unit Testing</title><content type='html'>&lt;b&gt;Introduction&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Hive has become an extremely important component in our overall software stack.  We have numerous ‘mission-critical’ reports that are generated using Hive and want to make sure we can apply our testing processes to Hive scripts in the same way that we apply them to other code artifacts.&lt;br /&gt;&lt;br /&gt;A few weeks ago, I was tasked with finding an approach for unit testing our Hive scripts. To my surprise, a Google search for ‘Hive Unit Testing’ yielded relatively few useful results.&lt;br /&gt;&lt;br /&gt;I wanted a solution that would allow us to test locally (vs. a solution that would require EMR).  Where possible, I prefer local testing because it’s simpler, provides more immediate feedback, and doesn’t require a network.&lt;br /&gt;&lt;br /&gt;After reading this post, you will (hopefully) know how to run Hive unit tests in your own environment.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Approach&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;After performing some research, I decided on an approach that is part of the Hive project itself. &amp;nbsp;At a high level, the solution works in the following way:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Start up an instance of the Hive CLI&lt;/li&gt;&lt;li&gt;Execute a Hive script (positive or negative case)&lt;/li&gt;&lt;li&gt;Compare the output (from the CLI) of the script compared to an expected output file&lt;/li&gt;&lt;li&gt;Rinse and repeat&lt;/li&gt;&lt;/ul&gt;The rest of this post discusses the specific steps required to get this solution running in your own environment.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Set up Hive Locally&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The first step is to create some Ant tasks for setting up Hive locally.  Here’s a snippet of Ant that shows how to do this:&lt;br /&gt;&lt;script class="brush:xml" type="syntaxhighlighter"&gt;&lt;![CDATA[  &lt;target name="hive.check"&gt;    &lt;available file="${hive.root}"  property="hive.available"/&gt;  &lt;/target&gt;  &lt;target name="hadoop.check"&gt;    &lt;available file="${hadoop.root}"  property="hadoop.available"/&gt;  &lt;/target&gt;  &lt;!-- install hadoop --&gt;  &lt;target name="hadoop.init" depends="retrieve, hadoop.check" unless="hadoop.available"&gt;    &lt;untar dest="${tools.dir}" compression="gzip"&gt;      &lt;fileset dir="${lib.dir}/buildtime"&gt;        &lt;include name="hadoop.tar.gz" /&gt;      &lt;/fileset&gt;    &lt;/untar&gt;    &lt;!-- make bin directory executable --&gt;    &lt;chmod perm="+x"&gt;      &lt;fileset dir="${tools.dir}"&gt;        &lt;include name="hadoop-*/bin/**" /&gt;      &lt;/fileset&gt;    &lt;/chmod&gt;  &lt;/target&gt;  &lt;!-- install hive --&gt;  &lt;target name="hive.init" depends="hadoop.init, retrieve, hive.check" unless="hive.available" description="Install hive into ${tools.dir}"&gt;    &lt;untar dest="${tools.dir}" compression="gzip"&gt;      &lt;fileset dir="${lib.dir}/buildtime"&gt;        &lt;include name="hive.tar.gz" /&gt;      &lt;/fileset&gt;    &lt;/untar&gt;    &lt;!-- make bin directory executable --&gt;    &lt;chmod perm="+x"&gt;      &lt;fileset dir="${tools.dir}"&gt;        &lt;include name="hive-*/bin/**" /&gt;      &lt;/fileset&gt;    &lt;/chmod&gt;  &lt;/target&gt;]]&gt;&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;You should now be able to execute ‘ant hive.init’ and have Hive available in the tools directory.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Generate test cases&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The developer is responsible for providing the .q Hive files that represent the test cases.  There is a code generation step that will create JUnit classes (one for positive test cases, one for negative test cases) given a set of .q files.  The Ant snippet below shows how to generate the test classes:&lt;br /&gt;&lt;script class="brush:xml" type="syntaxhighlighter"&gt;&lt;![CDATA[&lt;!-- classpath used for test code generation --&gt;  &lt;path id="hive.ant.classpath"&gt;    &lt;fileset dir="${hive.root}/lib"&gt;      &lt;include name="hive-anttasks-*.jar"/&gt;      &lt;include name="velocity-*.jar"/&gt;      &lt;include name="commons-lang-*.jar"/&gt;      &lt;include name="commons-collections-*.jar"/&gt;    &lt;/fileset&gt;  &lt;/path&gt;&lt;!-- generate test cases --&gt;  &lt;target name="hive.gen.test" depends="hive.test.conditions, hive.test.init" &gt;  &lt;taskdef name="qtestgen" classname="org.apache.hadoop.hive.ant.QTestGenTask"             classpathref="hive.ant.classpath"/&gt;    &lt;qtestgen outputDirectory="${target.gen.java.test.dir}/org/apache/hadoop/hive/cli"               templatePath="${hive.test.template.dir}"              template="TestCliDriver.vm"               queryDirectory="${target.hive.positive.query.dir}"               queryFile="${qfile}"              queryFileRegex="${qfile_regex}"              clusterMode="${clustermode}"              resultsDirectory="${hive.positive.results.dir}"              className="TestCliDriver"              logFile="${hive.test.log.dir}/testclidrivergen.log"              logDirectory="${hive.test.log.dir}"              hadoopVersion="${hadoop.version}"    /&gt;    &lt;qtestgen outputDirectory="${target.gen.java.test.dir}/org/apache/hadoop/hive/cli"               templatePath="${hive.test.template.dir}"              template="TestNegativeCliDriver.vm"               queryDirectory="${target.hive.negative.query.dir}"               queryFile="${qfile}"              queryFileRegex="${qfile_regex}"              clusterMode="${clustermode}"              resultsDirectory="${hive.negative.results.dir}"              className="TestNegativeCliDriver"              logFile="${hive.test.log.dir}/testnegativeclidrivergen.log"              logDirectory="${hive.test.log.dir}"              hadoopVersion="${hadoop.version}"    /&gt;  &lt;/target&gt;]]&gt;&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;Here are some notes about the key variables above:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;hive.test.template.dir&lt;/span&gt; - the directory where the velocity templates are located for the code generation step.&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;target.hive.positive.query.dir&lt;/span&gt; - the directory where positive test cases are located.&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;target.hive.negative.query.dir&lt;/span&gt; - the directory where negative test cases are located.&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;hive.positive.results.dir&lt;/span&gt; - the directory where expected positive test results are located.  The name of this file must be the name of the query file appened by ‘.out’.  For example, if the test query file is named hive_test.q then the results file must be named hive_test.q.out.&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;hive.negative.results.dir&lt;/span&gt; - the directory where expected negative test results are located.&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;qfile&lt;/span&gt; - This variable should be specified if you want to generate a test class with a single test case.  For example, if you have a test file named hive_test.q, then you would set the value of this property to hive_test (e.g. ant -Dqfile=hive_test hive.gen.test).&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;qfile_regex&lt;/span&gt; - Similar in functionality to qfile, this variable should be set to a regular expression that will match the test files that you want to generate tests for.&lt;/li&gt;&lt;/ul&gt;The test classes are generated from velocity template files.  You can find examples of the template from the Hive codebase here: &lt;a href="https://github.com/apache/hive/blob/trunk/ql/src/test/templates/TestCliDriver.vm"&gt;https://github.com/apache/hive/blob/trunk/ql/src/test/templates/TestCliDriver.vm&lt;/a&gt;&lt;br /&gt;&lt;a href="https://github.com/apache/hive/blob/trunk/ql/src/test/templates/TestNegativeCliDriver.vm"&gt;https://github.com/apache/hive/blob/trunk/ql/src/test/templates/TestNegativeCliDriver.vm&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The above files can basically be used as-is, but you will need to provide your own Test Helper class, QTestUtil, and update its package location accordingly in the templates.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;QTestUtil&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;QTestUtil contains code for:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;starting up hive&lt;/li&gt;&lt;li&gt;executing a query file&lt;/li&gt;&lt;li&gt;comparing the results to expected results&lt;/li&gt;&lt;li&gt;running cleanup between tests&lt;/li&gt;&lt;li&gt;shutting down hive&lt;/li&gt;&lt;/ul&gt;You can find the one from the Hive project here: &lt;a href="https://github.com/apache/hive/blob/trunk/ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java"&gt;https://github.com/apache/hive/blob/trunk/ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The main modifications you will want to make to this file are deletions as there is some Hive project specific set up code that you will not need in your environment.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Executing the tests&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;After you have generated the tests, you can execute them by creating a target with the junit task.  Here is some sample Ant for doing this:&lt;br /&gt;&lt;br /&gt;&lt;script class="brush:xml" type="syntaxhighlighter"&gt;&lt;![CDATA[&lt;target name="hive-tests.main"&gt;    &lt;property name="hive.test.classpath.id" value="hive.test.classpath"/&gt;    &lt;junit showoutput="${test.output}" printsummary="yes" haltonfailure="no"           fork="yes" maxmemory="512m" dir="${basedir}" timeout="${test.timeout}"           errorProperty="tests.failed" failureProperty="tests.failed" filtertrace="off"&gt;      &lt;env key="HADOOP_HOME" value="${hadoop.root}"/&gt;      &lt;env key="TZ" value="US/Pacific"/&gt;      &lt;sysproperty key="test.output.overwrite" value="${overwrite}"/&gt;      &lt;sysproperty key="test.service.standalone.server" value="${standalone}"/&gt;      &lt;sysproperty key="log4j.configuration" value="file://${target.hive.data.dir}/conf/hive-log4j.properties"/&gt;      &lt;sysproperty key="derby.stream.error.file" value="${target.hive.dir}/derby.log"/&gt;      &lt;sysproperty key="ql.test.query.clientpositive.dir" value="${target.hive.positive.query.dir}"/&gt;      &lt;sysproperty key="ql.test.results.clientpositive.dir" value="${hive.positive.results.dir}"/&gt;      &lt;sysproperty key="test.log.dir" value="${hive.test.log.dir}"/&gt;      &lt;sysproperty key="hadoop.log.dir" value="${hive.test.log.dir}"/&gt;      &lt;sysproperty key="test.silent" value="${test.silent}"/&gt;      &lt;sysproperty key="test.tmp.dir" value="${target.hive.dir}/tmp"/&gt;      &lt;sysproperty key="test.warehouse.dir" value="${target.hive.dir}/test/data/warehouse"/&gt;      &lt;sysproperty key="target.hive.dir" value="${target.hive.dir}"/&gt;      &lt;sysproperty key="build.dir" value="${target.hive.dir}"/&gt;      &lt;sysproperty key="build.dir.hive" value="${target.hive.dir}"/&gt;      &lt;classpath refid="${hive.test.classpath.id}"/&gt;      &lt;formatter type="${test.junit.output.format}" usefile="${test.junit.output.usefile}" /&gt;      &lt;batchtest todir="${target.testresults.dir}" unless="testcase"&gt;        &lt;fileset dir="${hive.test.build.classes}"                 includes="**/${hive.test.include}.class" /&gt;      &lt;/batchtest&gt;      &lt;batchtest todir="${target.testresults.dir}" if="testcase"&gt;        &lt;fileset dir="${hive.test.build.classes}" includes="**/${testcase}.class"/&gt;      &lt;/batchtest&gt;      &lt;assertions&gt;        &lt;enable /&gt;      &lt;/assertions&gt;    &lt;/junit&gt;    &lt;fail if="tests.failed"&gt;Tests failed!&lt;/fail&gt;  &lt;/target&gt;]]&gt;&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This post outlined a solution for unit testing Hive scripts.  Another nice aspect of this approach that I failed to mention is that it’s based on JUnit so you can use your existing code coverage tools with it (we use Cobertura) to get coverage information when testing custom UDFs.  Also, I should mention that I used Hive 0.6.0 when putting this together.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7793631849323213500?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7793631849323213500/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7793631849323213500' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7793631849323213500'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7793631849323213500'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/04/hive-unit-testing.html' title='Hive Unit Testing'/><author><name>Timo</name><uri>http://www.blogger.com/profile/05949421779840031276</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-6667155456780605543</id><published>2011-04-06T09:34:00.000-07:00</published><updated>2011-04-06T09:46:20.423-07:00</updated><title type='text'>Hive 0.7 no longer auto-downloads transform scripts</title><content type='html'>I ran into a bit of a surprise moving a Hive 0.5 script to Hive 0.7 the other day.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previously, in Hive 0.5, we called our Java transform code like:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;insert overwrite table the_table&lt;/div&gt;&lt;div&gt;select&lt;/div&gt;&lt;div&gt;  transform(...)&lt;/div&gt;&lt;div&gt;  using 'java -cp s3://bucket-name/code.jar MapperClassName'&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Behind the scenes, before actually calling the "java" executable, Hive would inspect each of the arguments and, if it found an "s3://..." URL, download that file from S3 to a local copy, and then pass the path to the local copy to your program.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This was convenient as then your external "java" executable didn't have to know anything about S3, how to authenticate with it, etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, in Hive 0.7, this no longer works. Perhaps for the understandable reason that if you did want to pass the literal string "s3://..." to your mapper class, Hive implicitly interjecting on your behalf may not be what you want, and, AFAIK, you had no way to avoid it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So, now an explicit "add file" command is required, e.g.:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;add file s3://bucket-name/code.jar&lt;/div&gt;&lt;div&gt;&lt;div&gt;insert overwrite table the_table&lt;/div&gt;&lt;div&gt;select&lt;/div&gt;&lt;div&gt;  transform(...)&lt;/div&gt;&lt;div&gt;  using 'java -cp code.jar MapperClassName'&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The add file command downloads code.jar to the local execution directory (without any bucket name/path mangling like in Hive 0.5), and then your transform arguments can reference the local file directly.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;All in all, a pretty easy fix, but rather frustrating to figure out given the long cycle time of EMR jobs.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Also, kudos to this post in the AWS developer forums that describes the same problem and solution:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="https://forums.aws.amazon.com/thread.jspa?messageID=225472"&gt;https://forums.aws.amazon.com/thread.jspa?messageID=225472&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-6667155456780605543?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/6667155456780605543/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=6667155456780605543' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6667155456780605543'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6667155456780605543'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/04/hive-07-no-longer-auto-downloads.html' title='Hive 0.7 no longer auto-downloads transform scripts'/><author><name>Stephen Haberman</name><uri>http://www.blogger.com/profile/05412274950722949930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-4133040520417862859</id><published>2011-03-11T11:14:00.000-08:00</published><updated>2011-03-11T11:16:16.700-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='culture'/><category scheme='http://www.blogger.com/atom/ns#' term='engineering'/><title type='text'>On Building a Kick Ass Engineering Team -- Part 1</title><content type='html'>&lt;div&gt;&lt;br /&gt;&lt;/div&gt;We started Bizo just about three years ago with the goal of building a great business and a world-class engineering team.  One of the first things I did was write down what I thought it would take to create a kick-ass engineering team.  &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I figured it would be good to share this list with others and discuss why I thought the items would help build a great engineering team.&lt;div&gt;&lt;br /&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;Here is what I wrote:&lt;br /&gt;&lt;br /&gt;&lt;iframe src="https://docs.google.com/document/pub?id=11SeKOWkoWyKQbd2GvoYFfhcvKS24h8ixDuiRtiwVTGU&amp;amp;embedded=true&amp;amp;hgd=1" width="450px" height="520px"&gt;&lt;/iframe&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Let's review these in a bit more detail...&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Commitment to Discipline&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;"Commitment to discipline" is perhaps the most important characteristic for any super productive team.  Software development is full of moments that try to pull you away from the task at hand or the problem that you &lt;i&gt;should be&lt;/i&gt; trying to solve.  Those weak moments where you get pulled into refactoring some sub-system end up being a huge time sink.  Being able to recognize that while it would be great to refactor &lt;i&gt;foo&lt;/i&gt;, it is not actually a requirement of "getting shit done" and can wait.  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Must be cultural&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;I believe that the culture of any startup comes from the founders and early employees and more generally the culture of any business comes from the leaders.  Therefore, it is absolutely imperative for these leaders to define what they want the culture to be and furthermore act on that definition.  Culture is self-reinforcing: actions create culture which creates actions which strengthen culture.   Just like a habit, once the culture is formed it is difficult to change it so it pays to be purposeful when creating your company's culture.  You will have to live with it good or bad.  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;The 3 Cs&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Communication, Communication Communication!  (I stole this directly from my high school baseball coach Mr. Barden.) A major part of engineering success comes down to communication.  What should we build, how should we build it, how can I plug into your system, how does this code work, etc.  We spend an enormous amount of time communicating and it shows.  We ensure communication through design reviews for all features/projects, and code reviews for every single line of code that makes it into production.  Code reviews are an amazing place to learn, ensure quality and share knowledge.  We even spend time communicating about non-Bizo related stuff that we find interesting through "Lunch and Learns".  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;I'll save how we go about managing all this communication for another post but the ramifications of our commitment to designs reviews, code reviews, and the like result in a team of engineers that are eager to get feedback, humble about their skills and strive to provide objective feedback to others.  Objectivity is an extremely valuable characteristic of a great engineer and something that we look for in hiring and something that we strive to develop and promote.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Testing&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Testing is a huge part of what we believe in here at Bizo. Beyond the obviousness of ensuring that your code works, testing has some great side effects.  For one, it is an investment in future development making it easier to develop features on top of an existing product base and making it easier to "pivot" (I used the word "turn" 3 years ago before pivot was in wide use).  Being a big believer in having some down time, I also consider testing an investment in one's weekend because things always seem to go wrong when you are not at work.  :)  Finally, extensive testing ensures a low amount of code debt which should be any dev team's goal.  You always have to pay it off and upfront payments are the cheapest...  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Visibility&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;This should really be part of the 3Cs and I would expand on this even more today.  Not only is it important for code metrics, build status and task management to be communicated widely it is extremely important for the efforts of engineers to be communicated.  As a company, we have a business standup and an engineering standup where group managers can engage with the teams and get visibility into what is being done.  We are also experimenting with bringing engineering earlier into the product development process at the scoping level.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Ownership&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Ownership is key to getting the most out of any team including engineering.  I think productivity and motivation is a function of happiness and a lot of happiness comes from being trusted to do your job.  At a certain point this comes back to hiring.  If you can hire people who are great cultural fits and great engineering fits than you must trust them to do their jobs.  That trust will make them better employees.  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;In Summary&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;To wrap this all up, I think we've done well staying true to what I outlined three years ago and I am extremely happy with the results.  I've been around a lot of engineering teams and I've never been more happy (or proud) of what we've been able to accomplish both with the products and code we've shipped and with the culture of engineering we've built.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;To Be Continued&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;In a follow up post I'll discuss what (if anything) I would add or remove from this list if I had to start over today and what I think we should focus on as an engineering team for the next few years.  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-4133040520417862859?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/4133040520417862859/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=4133040520417862859' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4133040520417862859'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4133040520417862859'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/03/on-building-kick-ass-engineering-team.html' title='On Building a Kick Ass Engineering Team -- Part 1'/><author><name>Donnie</name><uri>http://www.blogger.com/profile/13599133732419522440</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-3237206929366972951</id><published>2011-02-24T17:34:00.001-08:00</published><updated>2011-02-24T17:46:36.140-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><title type='text'>"dynamic" columns in Hive</title><content type='html'>One of the presentations at the &lt;a href="http://www.meetup.com/hbaseusergroup/events/16492913/"&gt;HBase meetup&lt;/a&gt; the other night was on building a query language on top of HBase.  No less than 3 people asked "Why not use Hive?".  The main reason given was that Hive is too slow for doing simple selects.  But, the other thing they really liked about using HBase was that your columns were dynamic -- it's easy to add new fields to your data.&lt;br /&gt;&lt;br /&gt;Most of the data we log is in a simple log file format:&lt;ul&gt;&lt;li&gt;One record per line, separated by newline.&lt;/li&gt;&lt;li&gt;Each record can have one or more fields.  Fields are separated by ^A (\001).&lt;/li&gt;&lt;li&gt;Each field is a key/value pair separated by ^B (\002).  Field order is not specified.&lt;/li&gt;&lt;/ul&gt;In practice this bascially looks like:&lt;pre&gt;&lt;br /&gt;ts=1298598378404/code=403/message=bad referrer: bizo.com/...&lt;br /&gt;&lt;/pre&gt;(Where / is ^A and = is ^B).&lt;p&gt;This is a format we chose pretty early on, way before we ever looked at Hive.  It turns out to be a great format:&lt;ul&gt;&lt;li&gt;Human readable.&lt;/li&gt;&lt;li&gt;Trivial to parse in any language.&lt;/li&gt;&lt;li&gt;Dynamic -- easy to add/remove fields from your data.&lt;/li&gt;&lt;/ul&gt;It also turns out that it works really well with Hive.  Our typical Hive table looks something like:&lt;pre class="prettyprint"&gt;&lt;br /&gt;  create external table api_logs(d map&amp;lt;string,string&amp;gt;)&lt;br /&gt;  partitioned by (...)&lt;br /&gt;  row format delimited&lt;br /&gt;    fields terminated by '\004'&lt;br /&gt;    collection items terminated by '\001'&lt;br /&gt;    map keys terminated by '\002'&lt;br /&gt;  stored as textfile&lt;br /&gt;;&lt;br /&gt;&lt;/pre&gt;That is, each row is just a single column, which is a map.  At first this seemed a little degenerate to me, but it actually models our data perfectly.  There are no guarantees about which fields are available, and it's easy to add/remove fields in the data over time.  I should mention that this is really just for our report input, typically our report output will be in a fixed format.&lt;br /&gt;&lt;br /&gt;If you're using Hive 0.6 or greater, with Hive &lt;a href="http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create.2BAC8-Drop_View"&gt;View support&lt;/a&gt; it's also easy to get the best of both worlds.&lt;pre class="prettyprint"&gt;&lt;br /&gt;  create view api_errors(ts, code, message) as&lt;br /&gt;  select d["ts"], d["code"], d["message"]&lt;br /&gt;    from api_logs&lt;br /&gt;   where d["code"] &gt;= 400&lt;br /&gt;;&lt;br /&gt;&lt;/pre&gt;You can even change the type information or transform the data by including a cast or a UDF as part of your view.  Creating a view doesn't cause anything to run, or create any additional storage.  Its query conditions are basically just merged with subsequent queries on that view.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-3237206929366972951?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/3237206929366972951/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=3237206929366972951' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3237206929366972951'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3237206929366972951'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/02/columns-in-hive.html' title='&amp;quot;dynamic&amp;quot; columns in Hive'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-751122382075416949</id><published>2011-01-27T17:09:00.000-08:00</published><updated>2011-01-27T17:35:32.144-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GWT'/><title type='text'>Adventures In GWT-land Part #1: Awkward Baby Steps</title><content type='html'>&lt;p&gt;Over the past several months we’ve been working on a (super) secret shiny new GWT application (it’s basically going to rock your socks off). This was my first GWT application and coming from a non-Java, non-GWT background where I was used to writing raw Javascript pretty often - it’s been interesting to say the least. What follows is the first in a multi-part series where I’d like to reflect on life in GWT-land and hopefully provide a few cool tips and code samples along the way.&lt;/p&gt;&lt;h3&gt;First Steps:&lt;/h3&gt;&lt;p&gt;Stepping into GWT for the first time when your used to pure Javascript - or “normal” CSS/HTML front-end development in general is pretty awkward. The biggest adjustment is that all client-side logic is defined in Java classes which later get compiled into several permutations of obfuscated Javascript. GWT also tries to be helpful and obfuscates all of your CSS classes in an attempt to prevent namespace collisions (more on this in a future post). If your thinking that FireBug becomes much less useful when working with GWT your absolutely right...but thanks to some really good debugging tools in GWT, this isn’t a terrible loss and you really won’t need it.&lt;/p&gt;&lt;h3&gt;What’s with this Java -&gt; Javascript stuff?&lt;/h3&gt;&lt;p&gt;Why would somebody want to write a compiler for Java  -&gt; Javascript? One of the primary arguments for doing this is type-safety...which is fair enough I suppose - compile time checking is nice to have. You also get the benefit of native Java debugging tools, which despite the huge advances in client-side debugging in the past few years, Java’s debugging is still superior (mostly because of static typing). It’s really nice to be able to step through line by line, set breakpoints and inspect typed objects. However despite these benefits writing Java code that compiles into Javascript still feels weird.&lt;/p&gt;&lt;p&gt;The biggest reason for this awkwardness is that normally programmers write code in a language that is more expressive than your compile target, e.x. C/C++ compiles to assembly, Java to the JVM’s bytecode, CoffeeScript to Javascript etc. But with GWT your compile target is actually more expressive than the code you write and it’s not just a little bit more expressive, it’s a lot more expressive - which is a strange feeling indeed. Javascript has lambdas, prototypal inheritance, a less verbose syntax and a dynamic type system - all of  which lead to a more expressive language, allowing you to do more with less code. Java on the other hand has none of these - expressing things that are normally trivial in Javascript (like a custom event system, currying etc.) can become a major chore (sometimes 4-5 classes or more) in GWT. Actually the whole process is rather akin to writing C/C++ code that compiles into Ruby or Python (ok, perhaps a slight exaggeration...but really only by a bit). But fear not, below are some tips to help make life easier for you in GWT-land.&lt;/p&gt;&lt;h3&gt;Adjusting to life in GWT-land&lt;/h3&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;First use the gwt-mpv framework (&lt;a href="http://www.gwtmpv.org/"&gt;http://www.gwtmpv.org/&lt;/a&gt;) - it will save your sanity and quite possibly your soul. One of our awesome developers, Stephen has created a very nice model, view, presenter framework on top of GWT that removes a huge amount of boilerplate. It has some really nice stuff like validation, two way data binding, code generation for tedious boilerplate and more. Your fingers and brain will thank you for saving them from the verbosity of vanilla GWT, trust me...I’m an engineer.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Abandon the notion of separation of content (HTML), presentation (CSS), and behavior (Javascript) found in traditional front-end development. GWT, being a framework is opinionated and uses widgets instead. Widgets are generally responsible for all three of these things at once - they know their own CSS classes, keep their own data model and have event handlers. The idea behind this is you can just drop a widget onto any page and have it “just work” with no external dependencies. The disadvantage is the approach is inflexible if you want to change any of those three things while holding the others constant. In practice Widgets work quite well until you need to change or extend one that lives in somebody else’s jar - then they can quickly turn into a pain. So choose your imported widgets carefully or be prepared to fork things (or mash ctrl-c ctrl-v  a lot).&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Even though you’re writing in Java - you’re still working on the client side. Custom events are great for decoupling and facilitating easy testing of your objects as you only have to talk to a single event handler instead of multiple objects. Don’t go overboard though, you still want to avoid having events that call events that call events etc. as you’ll end up jumping all over the place trying to find out what really was supposed to happen when the original event fired.&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-751122382075416949?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/751122382075416949/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=751122382075416949' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/751122382075416949'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/751122382075416949'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/01/adventures-in-gwt-land-part-1-awkward.html' title='Adventures In GWT-land Part #1: Awkward Baby Steps'/><author><name>Josh Carver</name><uri>http://www.blogger.com/profile/15167764329841650102</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8241149693920151899</id><published>2011-01-26T16:55:00.001-08:00</published><updated>2011-01-26T17:10:58.023-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><category scheme='http://www.blogger.com/atom/ns#' term='emr'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>EMR/Hive: recovering a large number of partitions</title><content type='html'>If you try to run "&lt;tt&gt;alter table ... recover partitions&lt;/tt&gt;" on a table with a large number of partitions, you may run into this error:&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;&lt;br /&gt;FAILED: Error in metadata: org.jets3t.service.S3ServiceException: Failed to sanitize XML document destined for handler class org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler null 'null' -- ResponseCode: -1, ResponseStatus: null, RequestId: null, HostId: null&lt;br /&gt;FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask&lt;br /&gt;&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;There's some &lt;a href="https://forums.aws.amazon.com/thread.jspa?messageID=216835"&gt;discussion&lt;/a&gt; in the aws forums.  The underlying cause is that it's running out of memory when trying to build the partition list.&lt;br /&gt;&lt;br /&gt;A workaround is to increase the &lt;tt&gt;HADOOP_HEAPSIZE&lt;/tt&gt;.  This can be done by modifying &lt;a href="http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?UsingEMR_Config.html#UsingEMR_Config_hadoop-user-env.sh"&gt;hadoop-user-env.sh&lt;/a&gt; with an &lt;a href="http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?Bootstrap.html"&gt;EMR bootstrap action&lt;/a&gt;.  On an m1.large instance, 2G seems to do the trick for us.&lt;br /&gt;&lt;br /&gt;Upload a script like the following somewhere in s3:&lt;br /&gt;&lt;br /&gt;&lt;script src="https://gist.github.com/797836.js?file=set-hadoop-heap.sh"&gt;&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;You can now run this bootstrap action as part of your job:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;elastic-mapreduce --create --alive \&lt;br /&gt;      --name "large partitions..." --hive-interactive \&lt;br /&gt;      --num-instances 1 --instance-type m1.large \&lt;br /&gt;      --hadoop-version 0.20 \&lt;br /&gt;      --bootstrap-action s3://&amp;lt;bucket/path&amp;gt;/set-hadoop-heap.sh&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;You should now be able to load your partitions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8241149693920151899?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8241149693920151899/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8241149693920151899' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8241149693920151899'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8241149693920151899'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2011/01/emrhive-recovering-large-number-of.html' title='EMR/Hive: recovering a large number of partitions'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-944017010746898608</id><published>2010-11-17T09:25:00.000-08:00</published><updated>2010-11-17T09:51:58.459-08:00</updated><title type='text'>Spring NamespaceHandler debugging</title><content type='html'>While updating a library that used Spring yesterday, I began suffering the dreaded "Unable to locate Spring NamespaceHandler for XML schema namespace" exception.&lt;br /&gt;&lt;br /&gt;The net is littered with threads about this exception, all of which end with "put the spring jars in your WEB-INF/lib directory".&lt;br /&gt;&lt;br /&gt;I was fairly determined not to do this, as I frequently start a webapp with an embedded instance of Jetty, and use the Eclipse project's classpath for all of the dependencies. Since all of the jars are from the project's classpath, the webapp is always using exactly the same jars that Eclipse is for compiling your code, so you never have the two drift apart.&lt;br /&gt;&lt;br /&gt;Too many times, after copying jars to WEB-INF/lib and forgetting about them, I'll upgrade a library, everything compiles fine, but spend an embarrassing amount of time wondering why it's not working in the webapp, before remembering the stale jar in the WEB-INF/lib directory.&lt;br /&gt;&lt;br /&gt;Anyway, the real cause of the NamespaceHandler exception in my case was a buggy ClassLoader.getResources implementation.&lt;br /&gt;&lt;br /&gt;The way spring works, when doing it's XML parsing/whatever magic, is when it sees "xmlns:tx=...", it wants a new NamespaceHandler that knows how to handle those tags.&lt;br /&gt;&lt;br /&gt;To allow extensible NamespaceHandlers, Spring uses ClassLoader.getResources("META-INF/spring.handers") to get a list of all of the spring.handlers files across all of the jar files in the classloader. So if spring-core, spring-tx, spring-etc. all have spring.handlers with NamespaceHandlers in them, each file gets found and loaded.&lt;br /&gt;&lt;br /&gt;Here's the rub: whatever Eclipse project classloader that had spring-tx on it only returned spring.handlers files from jars &lt;b&gt;that already had classes loaded from them&lt;/b&gt;. The ClassLoader.getResources implementation would not look into jar files that was on its classpath, but had not yet been opened for loading classes from.&lt;br /&gt;&lt;br /&gt;Of all things, adding:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" &gt;   Class.forName("org.springframework.transaction.support.TransactionTemplate");&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Before spring was initialized fixed the NamespaceHandler error. Everything boots up correctly now.&lt;br /&gt;&lt;br /&gt;While it took way too long to figure out, I'm pleased that I can continue using the Eclipse project classloader for the webapp I'm starting and avoid the annoying "copy jars to WEB-INF/lib" solution.&lt;br /&gt;&lt;br /&gt;I'd like to know which classloader had the buggy getResources implementation, but I've already spent too much time on this so far.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-944017010746898608?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/944017010746898608/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=944017010746898608' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/944017010746898608'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/944017010746898608'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/11/spring-namespacehandler-debugging.html' title='Spring NamespaceHandler debugging'/><author><name>Stephen Haberman</name><uri>http://www.blogger.com/profile/05412274950722949930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-6118701443769789654</id><published>2010-11-12T17:09:00.001-08:00</published><updated>2011-02-28T10:04:43.604-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><title type='text'>CSV and Hive</title><content type='html'>&lt;h1&gt;CSV&lt;/h1&gt;Anyone who's ever dealt with CSV files knows how much of a pain the format actually is to parse.  It's not as simple as splitting on commas -- the fields might have commas embedded in them, so, okay you put quotes around the field... but what if the field had quotes in it?  Then you double up the quotes... "okay, ""great""" -- that was a single CSV field.&lt;br /&gt;&lt;br /&gt;We normally use the excellent &lt;a href="http://opencsv.sourceforge.net/"&gt;opencsv&lt;/a&gt;  (apache2 licensed) library to deal with CSV files.&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Hive&lt;/h1&gt;We love &lt;a href="http://hive.apache.org/"&gt;Hive&lt;/a&gt;.  Almost all of our reporting is written as Hive scripts.  How do you deal with CSV files with Hive?  If you know for sure your fields don't have any commas in them, you can get away with the delimited format.  There's the RegexSerDe, but as mentioned the format is non-trivial, and you need to change the regex string depending on how many columns you are expecting.&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;CSVSerde&lt;/h1&gt;Enter the &lt;a href="https://github.com/ogrodnek/csv-serde"&gt;CSVSerde&lt;/a&gt;.  It's a Hive &lt;a href="http://wiki.apache.org/hadoop/Hive/SerDe"&gt;SerDe&lt;/a&gt; that uses the opencsv parser to serialize and deserialize tables properly in the CSV format.&lt;br /&gt;&lt;br /&gt;Using it is pretty simple:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;&lt;br /&gt;add jar path/to/csv-serde.jar;&lt;br /&gt;&lt;br /&gt;create table my_table(a string, b string, ...)&lt;br /&gt;  row format serde 'com.bizo.hive.serde.csv.CSVSerde'&lt;br /&gt;  stored as textfile&lt;br /&gt;;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This is my first time writing a Hive SerDe.  There were a couple of road bumps, but overall I was surprised with how easy it was.  I mostly just followed along with &lt;a href="https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java"&gt;RegexSerDe&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I'm sure there are a lot of ways it could be improved, so I'd appreciate any feedback or comments on how to make it better.&lt;br /&gt;&lt;br /&gt;&lt;a href="https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java"&gt;Source&lt;/a&gt;.&lt;br /&gt;&lt;a href="https://github.com/downloads/ogrodnek/csv-serde/csv-serde-1.0.jar"&gt;Binary&lt;/a&gt; (jar packaged with opencsv).&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-6118701443769789654?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/6118701443769789654/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=6118701443769789654' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6118701443769789654'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6118701443769789654'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/11/csv-and-hive.html' title='CSV and Hive'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-5478609748349893447</id><published>2010-10-26T17:14:00.000-07:00</published><updated>2010-10-26T17:43:54.109-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gslb'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><category scheme='http://www.blogger.com/atom/ns#' term='ops'/><title type='text'>Rolling out to 4 Global Regional Datacenters in 25 minutes</title><content type='html'>Sometimes I just have to sit back and reflect on the amazing operational power available on &lt;a href="http://aws.amazon.com/"&gt;AWS&lt;/a&gt;.  As you know, we are &lt;a href="http://aws.amazon.com/about-aws/whats-new/2009/12/09/announcing-aws-start-up-challenge-winner/"&gt;hardcore AWS-ers&lt;/a&gt; here at Bizo and we've been &lt;a href="http://dev.bizo.com/2010/05/improving-global-application.html"&gt;running in all 4 regions for several months&lt;/a&gt;.  Recently we needed to roll out a new service which we wanted to be Globally Load Balanced (&lt;a href="http://www.google.com/search?&amp;amp;q=define:gslb"&gt;GSLB&lt;/a&gt;) and the rollout was astoundingly quick and easy.  &lt;b&gt;The total time it took us to go from 0 to 4 regions was 25 minutes!!!&lt;/b&gt;  Amazing!&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Only 25 minutes to setup a service that is running in 4 regions and 8 datacenters that will autoscale to handle pretty much any amount of load we send it!&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Shout out to AWS and &lt;a href="http://dyn.com/enterprise-dynect-platform"&gt;Dynect&lt;/a&gt; for making it almost too easy...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-5478609748349893447?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/5478609748349893447/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=5478609748349893447' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5478609748349893447'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5478609748349893447'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/10/rolling-out-to-4-global-regional.html' title='Rolling out to 4 Global Regional Datacenters in 25 minutes'/><author><name>Donnie</name><uri>http://www.blogger.com/profile/13599133732419522440</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7030804341627213505</id><published>2010-10-21T13:11:00.000-07:00</published><updated>2010-10-21T14:01:13.738-07:00</updated><title type='text'>An experiment in file distribution from S3 to EC2 via bittorrent</title><content type='html'>Amazon's &lt;a href="http://aws.amazon.com/autoscaling/"&gt;autoscaling&lt;/a&gt; service is fantastic.  It allows you to dynamically scale the number of instances running your application based on a variety of triggers, including CPU usage, request latency, I/O usage, and more.  Thus, you can increase your capacity in response to increased demand for your services.&lt;br /&gt;&lt;br /&gt;One difficulty with this approach is that your response time is strictly bounded by the time it takes for you to spin up a new instance with your application running on it.  This isn't a big deal for most servers, but some of our backend systems need multi-GB databases and indexes loaded onto them at startup.&lt;br /&gt;&lt;br /&gt;There are several strategies for working around this, including baking the indexes into the AMI and distributing them via EBS volume; however, I was intrigued by the possibility of using &lt;a href="http://docs.amazonwebservices.com/AmazonS3/index.html?S3Torrent.html"&gt;S3's bittorrent support&lt;/a&gt; to enable peer-to-peer downloads of data.  In an autoscaling situation, there are presumably several instances with the necessary data already running, and using bittorrent should allow us to quickly copy that file to a new instance.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Test setup:&lt;/h3&gt;&lt;br /&gt;&lt;br /&gt;All instances were m1.smalls running Ubuntu Lucid in us-east-1, spread across two availability zones.  The test file was a 1GB partition of a larger zip file.&lt;br /&gt;&lt;br /&gt;For a client, I used the version of &lt;a href="http://www.bittornado.com/"&gt;Bittornado&lt;/a&gt; available in the standard repository (apt-get install -y bittornado).  Download and upload speeds were simply read off of the curses interface.&lt;br /&gt;&lt;br /&gt;For reference, I clocked a straight download of this file directly from S3 as taking an average of 57 seconds, which translates into almost 18 MB/s.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Test results:&lt;/h3&gt;&lt;br /&gt;&lt;br /&gt;First, I launched a single instance and started downloading from S3.  S3 only gave me 70-75KB/s, considerably less than direct S3 downloads.&lt;br /&gt;&lt;br /&gt;As the first was still downloading, I launched a second instance.  The second instance quickly caught up to the first, then the download rate on each instance dropped to 140-150KB/s with upload rates at half that.  Clearly, what was going on was S3 was giving each instance 70-75KB/s of bandwidth, and the peers were cooperating by sharing their downloaded fragments.&lt;br /&gt;&lt;br /&gt;To verify this behavior, I then launched two more instances and hooked them into the swarm.  Again, the new peers quickly caught up to the existing instances, and download rates settled down to 280-300KB/s on each of the four instances.&lt;br /&gt;&lt;br /&gt;So, there's clearly some serious throttling going on when downloading from S3 via bittorrent.  However, the point of this experiment is not the S3 -&gt; EC2 download speed but the EC2 &lt;-&gt; EC2 file sharing speed.&lt;br /&gt;&lt;br /&gt;Once all four of these instances were seeding, I added a fifth instance to the swarm.  Download rates on this instance maxed out at around 12-13 MB/s.  Once this instance was seeding, I added a sixth instance to the swarm to see if bandwidth would continue to scale up, but I didn't see an appreciable difference.&lt;br /&gt;&lt;br /&gt;So, it looks like using bittorrent within EC2 is actually only about 2/3rds as fast as downloading directly from S3.  In particular, even with a better tuned environment (eg, moving to larger instances to eliminate sharing physical bandwidth with other instances), it doesn't look like we would get any significant decreases in download times by using bittorrent.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7030804341627213505?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7030804341627213505/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7030804341627213505' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7030804341627213505'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7030804341627213505'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/10/experiment-in-file-distribution-from-s3.html' title='An experiment in file distribution from S3 to EC2 via bittorrent'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-1746179975376746655</id><published>2010-10-01T14:31:00.000-07:00</published><updated>2010-10-01T15:14:10.926-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java kill process'/><title type='text'>Killing java processes</title><content type='html'>I often want to kill java processes, be it an unresponsive Eclipse, a blown-out jEdit after I try to open a 2GB file, a stalled JUnit test suite, a borked scalac compiler daemon or a random Tomcat instance.&lt;br /&gt;&lt;br /&gt;It gets tiring to write,&lt;pre class="prettyprint"&gt;$ jps -lv&lt;br /&gt;48231 /opt/eclipse-3.5.1/org.eclipse.equinox.launcher_1.0.201.jar -Xmx1024m&lt;br /&gt;10258 /opt/boisvert/jedit-4.3.2/jedit.jar -Xmx192M&lt;br /&gt;5295 sun.tools.jps.Jps -Dapplication.home=/opt/boisvert/jdk1.6.0_21 -Xms8m&lt;br /&gt;&lt;/pre&gt; followed by,&lt;pre class="prettyprint"&gt;$ kill 48231&lt;/pre&gt;You know, with the cut &amp;amp; paste in-between ... so I have this Ruby shell script called &lt;code&gt;killjava&lt;/code&gt;, a close cousin of &lt;code&gt;killall&lt;/code&gt;:&lt;br /&gt;&lt;pre class="prettyprint"&gt;$ killjava -h&lt;br /&gt;killjava [-9] [-n] [java_main_class]&lt;br /&gt;&lt;br /&gt;-9, --KILL                       Send KILL signal instead of TERM&lt;br /&gt;-n, --no-prompt                  Do not prompt user, kill all matching processes&lt;br /&gt;-h, --help                       Show this message&lt;br /&gt;&lt;/pre&gt;that does the job.   It's not like I use it everyday but everytime I use it, I'm glad it's there.&lt;br /&gt;&lt;br /&gt;Download the &lt;a href="http://gist.github.com/606895"&gt;script&lt;/a&gt; from Github (requires Ruby and UNIX-based OS).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-1746179975376746655?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/1746179975376746655/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=1746179975376746655' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1746179975376746655'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1746179975376746655'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/10/killing-java-processes.html' title='Killing java processes'/><author><name>Alex Boisvert</name><uri>http://www.blogger.com/profile/05164682765137205886</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8598079096712394150</id><published>2010-10-01T09:37:00.001-07:00</published><updated>2010-10-01T09:37:34.977-07:00</updated><title type='text'>modern IDEs influencing coding style?</title><content type='html'>&lt;blockquote&gt;&lt;br /&gt;It would be nice if globals, locals, and members could be syntax colored differently. That would be better than g_ and m_ prefixes.&lt;br /&gt;&lt;p&gt;- &lt;a href="http://twitter.com/ID_AA_Carmack/status/25121804310"&gt;John Carmack&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I saw this from John Carmack last week and thought, what a great idea!  It seems very natural and easy to do and makes a lot more sense than crazy prefix conventions.  I've been mostly programming in Java, so the conventions are a little different, but I'd love it if we could get rid of using redundant "this" qualifiers to signal member variables, and the super ugly ALL_CAPS for constants... it just seems so outdated.&lt;br /&gt;&lt;br /&gt;Eclipse actually provides this kind of highlighting already:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/eclipse/code/eclipse-code-1.png" /&gt;&lt;br /&gt;&lt;br /&gt;Notice that the member variable "greeting" is always in &lt;span style="color: #004BC4"&gt;blue&lt;/span&gt;, while the non-member variables are never highlighted.  Also, the public static constant "DEFAULT_GREETING" is &lt;span style="color: #004BC4"&gt;&lt;i&gt;blue and italicized&lt;/i&gt;&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Notice that if you rename DEFAULT_GREETING, it's still &lt;em&gt;completely recognizable&lt;/em&gt; as a constant:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/eclipse/code/eclipse-code-2.png" /&gt;&lt;br /&gt;&lt;br /&gt;I think it's interesting that modern IDEs are able to give us so much more information about the structure of our programs.  Stuff you used to have to explicitly call out via conventions like these.  How long until we're ready to make the leap and change our code conventions to keep up with our tools?&lt;br /&gt;&lt;br /&gt;The main argument against relying on tools to provide this kind of information is that not &lt;i&gt;all&lt;/i&gt; tools have caught up.  I'm not sure I completely buy this.  Hopefully you're not actually remotely editing production code in vi or something.  There are a lot of web apps for viewing commits and performing code reviews, and they're unlikely to be as fully featured as your favorite IDE.  Still, the context is often limited enough to avoid confusion, and the majority of our time is spent in our IDEs anyway.&lt;br /&gt;&lt;br /&gt;So, can we drop the ALL_CAPS already?&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8598079096712394150?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8598079096712394150/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8598079096712394150' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8598079096712394150'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8598079096712394150'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/10/modern-ides-influencing-coding-style.html' title='modern IDEs influencing coding style?'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-492933084036886394</id><published>2010-09-29T16:12:00.001-07:00</published><updated>2010-09-29T16:17:15.501-07:00</updated><title type='text'>emr: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory</title><content type='html'>Moving one of our jobs from hive 0.4 / hadoop 0.18 to hive 0.5 / hadoop 0.20 on amazon emr, I ran into a weird error in the reduce stage, something like:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;java.io.IOException: Task: attempt_201007141555_0001_r_000009_0 - The reduce copier failed &lt;br /&gt;at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:384) &lt;br /&gt;at org.apache.hadoop.mapred.Child.main(Child.java:170) &lt;br /&gt;Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory &lt;br /&gt;at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) &lt;br /&gt;at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) &lt;br /&gt;at org.apache.hadoop.util.Shell.run(Shell.java:134) &lt;br /&gt;at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) &lt;br /&gt;at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329) &lt;br /&gt;at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) &lt;br /&gt;at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160) &lt;br /&gt;at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2622) &lt;br /&gt;at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2586) &lt;br /&gt;Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory &lt;br /&gt;at java.lang.UNIXProcess.&lt;init&gt;(UNIXProcess.java:148) &lt;br /&gt;at java.lang.ProcessImpl.start(ProcessImpl.java:65) &lt;br /&gt;at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) &lt;br /&gt;... 8 more &lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;There's some discussion on &lt;a href="http://developer.amazonwebservices.com/connect/thread.jspa?messageID=186499&amp;#186499"&gt;this thread in the emr forums&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://developer.amazonwebservices.com/connect/profile.jspa?userID=79478"&gt;Andrew's&lt;/a&gt; response to the thread:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;  The issue here is that when Java tries to fork a process (in this case bash), Linux allocates as much memory as the current Java process, even though the command you are running might use very little memory. When you have a large process on a machine that is low on memory this fork can fail because it is unable to allocate that memory. &lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;The workaround here is to either use an instance with more memory (m2 class), or reduce the number of mappers or reducers you are running on each machine to free up some memory.&lt;br /&gt;&lt;br /&gt;Since the task I was running was reduce heavy, I chose to just drop the number of mappers from 4 to 2.  You can do this pretty easy with the &lt;a href="http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?PredefinedBootstrapActions.ConfigureHadoop.html"&gt;emr bootstrap actions&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;My job ended up looking something like this:&lt;br /&gt;&lt;br /&gt; elastic-mapreduce --create --name "awesome script" \&lt;br /&gt; --num-instances 8 --instance-type m1.large \ &lt;br /&gt; --hadoop-version 0.20 \ &lt;div style="color: #970a2d"&gt;--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \ &lt;br /&gt; --args "-s,mapred.tasktracker.map.tasks.maximum=2" \ &lt;/div&gt; --hive-script --arg  s3://....../script &lt;br /&gt;&lt;p&gt;(relevant parts highlighted).&lt;/p&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-492933084036886394?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/492933084036886394/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=492933084036886394' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/492933084036886394'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/492933084036886394'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/09/emr-cannot-run-program.html' title='emr: Cannot run program &amp;quot;bash&amp;quot;: java.io.IOException: error=12, Cannot allocate memory'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-1808173114239880271</id><published>2010-09-21T05:02:00.000-07:00</published><updated>2010-09-22T15:20:36.863-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='salesforce dart'/><title type='text'>Salesforce and DART Synchronization</title><content type='html'>&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;I’ve recently started some work that involves extending Salesforce for our Ad Ops team.  For our most recent Hack Day, I decided to do a little project to continue learning about development with the Salesforce cloud platform, Force.com.&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;After thinking about what I wanted to work on, I decided to build a custom button that would allow a user to update an Account record in Salesforce with an Advertiser ID from DART, our primary ad serving platform, for the following reasons:&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;ol style="font-family: Times; font-size: medium; "&gt;&lt;li style="list-style-type: decimal; font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;It’s a tool that I could see being used in our live Salesforce instance.&lt;/span&gt;&lt;/li&gt;&lt;li style="list-style-type: decimal; font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;It seems like a typical use case for extending Salesforce (i.e. integrating with a 3rd party SOAP service).&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;The back of the napkin design looked like this:&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; font-family: Times; font-size: medium; "&gt;&lt;img src="http://3.bp.blogspot.com/_AQ2q1Jl3xrU/TJifw8ApaiI/AAAAAAAAABc/go60HqKG42Q/s400/salesforce_hackday.png" /&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;At a high-level, I wanted to call DART’s DFP API from within Salesforce and then update an Account object in Salesforce with the Advertiser Id returned from DART.  However, I first needed to authenticate with Google’s ClientLogin service in order to get an authentication token for calling the DFP API.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;APEX&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;APEX is the programming language that allows a developer to customize a Salesforce installation.  APEX’s syntax, not surprisingly, is very similar to Java.  The really interesting thing is that none of the code you write actually compiles or runs on your machine.  All compilation and execution happen “in the cloud”.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;DART Integration&lt;/span&gt;&lt;span style="font-size: 24pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Salesforce has a strict security model.  In order to make a request to a Web Service you actually need to configure any URLs you are accessing as a Remote Site.  Instructions for doing this can be found &lt;/span&gt;&lt;a href="http://wiki.developerforce.com/index.php/Apex_Web_Services_and_Callouts"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 153); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap; "&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;.  For this project, I simply needed to add https://www.google.com as a Remote Site.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;There are a couple of options for calling a Web Service via APEX:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;ul&gt;&lt;li style="list-style-type: disc; font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Use the &lt;/span&gt;&lt;a href="http://www.salesforce.com/us/developer/docs/apexcode/index_Left.htm#StartTopic=Content/apex_qs_HelloWorld.htm"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 153); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap; "&gt;Http/HttpRequest/Http&lt;/span&gt;&lt;/a&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt; APEX classes.  These are useful for calling REST style services. &lt;/span&gt;&lt;/li&gt;&lt;li style="list-style-type: disc; font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Import a WSDL and use the generated code to make a SOAP request.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;In this project, I ended up using both methods.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Here is the APEX code I developed for calling Google’s ClientLogin authentication service:&lt;/span&gt;&lt;br /&gt;&lt;pre&gt;public class GoogleAuthIntegration {&lt;br /&gt;  private static String CLIENT_AUTH_URL = 'https://www.google.com/accounts/ClientLogin';&lt;br /&gt;&lt;br /&gt;  // login to google with the given email and password&lt;br /&gt;  public static String performClientLogin(final String email, final String password) {&lt;br /&gt;    final Http http = new Http();&lt;br /&gt;    final HttpRequest request = new HttpRequest();&lt;br /&gt;    request.setEndpoint(CLIENT_AUTH_URL);&lt;br /&gt;    request.setMethod('POST');&lt;br /&gt;    request.setHeader('Content-type', 'application/x-www-form-urlencoded');&lt;br /&gt; &lt;br /&gt;    final String body = 'service=gam&amp;amp;accountType=GOOGLE&amp;amp;' + 'Email=' + email + '&amp;amp;Passwd=' + password;&lt;br /&gt;    request.setBody(body);&lt;br /&gt; &lt;br /&gt;    final HttpResponse response = http.send(request);&lt;br /&gt;    final String responseBody = response.getBody();&lt;br /&gt;    final String authToken = responseBody.substring(responseBody.indexOf('Auth=') + 5).trim();&lt;br /&gt; &lt;br /&gt;    System.debug('authToken is: ' + authToken);&lt;br /&gt;    return authToken;&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;This piece of code would fetch an authToken for the given username and password.  Once I had the authToken, I could then call the DFP API.  For this part, I used WSDL/SOAP, the 2nd method for calling web services.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Salesforce provides a way to import a WSDL file via its Admin UI.  It then parses and generates APEX code that allows you to call methods exposed by the WSDL.  However, when I tried importing DFP’s &lt;/span&gt;&lt;a href="https://www.google.com/apis/ads/publisher/v201004/CompanyService?wsdl" style="font-family: Times; font-size: medium; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 153); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap; "&gt;Company Service WSDL&lt;/span&gt;&lt;/a&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;, I ran into some errors:&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; "&gt;&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;img src="https://lh5.googleusercontent.com/FmukHikPYaEmX9GivQ5Tmc5Ot6PjyRAvD5UXTz-GZ6MlCIUgJ3vh39BvJk050-sJ_C_cCt15zRmUHNZ9AsXBJ6pMG5FjtZGfNXBiQurySbFr18T5lw" width="673px;" height="231px;" /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;It turns out that the WSDL contains an element named ‘trigger’ and trigger is a reserved APEX keyword.  In any event, I ended up copy/pasting the generated code and fixing it so that it compiled correctly (I also ran into a problem where generated exception classes were not extending Exception).  &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Once the code to call the DFP Company Service was compiling, I created an APEX controller to perform the update on an Account record.&lt;/span&gt;&lt;pre&gt;public class SyncDartAccountController {&lt;br /&gt;  private final Account acct;&lt;br /&gt; &lt;br /&gt;  public SyncDartAccountController(ApexPages.StandardController stdController) {&lt;br /&gt;    this.acct = (Account) stdController.getRecord();&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  // Code we will invoke on page load.&lt;br /&gt;  public PageReference onLoad() {&lt;br /&gt;    String theId = ApexPages.currentPage().getParameters().get('id');&lt;br /&gt;&lt;br /&gt;    if (theId == null) {&lt;br /&gt;      // Display the Visualforce page's content if no Id is passed over&lt;br /&gt;      return null;&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    // get authToken for DFP API requests&lt;br /&gt;    String authToken = GoogleAuthIntegration.performClientLogin('xxx@xxx.com', 'xxxx');&lt;br /&gt;&lt;br /&gt;    // get Account with the given id&lt;br /&gt;    for (Account o:[select id, name from Account where id =:theId]) {&lt;br /&gt;      DartCompanyService.CompanyServiceInterfacePort p = new DartCompanyService.CompanyServiceInterfacePort();&lt;br /&gt;      p.RequestHeader = new DartCompanyService.SoapRequestHeader();&lt;br /&gt;      p.RequestHeader.applicationName = 'sampleapp';&lt;br /&gt;   &lt;br /&gt;      // prepare the DFP query and execute&lt;br /&gt;      DartCompanyService.Statement filterByNameAndType = new DartCompanyService.Statement();&lt;br /&gt;      filterByNameAndType.query = 'WHERE name = \'' + o.Name + '\' and type = \'ADVERTISER\'';&lt;br /&gt;  &lt;br /&gt;      DartCompanyService.CompanyPage page = p.getCompaniesByStatement(filterByNameAndType);&lt;br /&gt;   &lt;br /&gt;      if (page.totalResultSetSize &gt; 0) {&lt;br /&gt;        // update the record if we get a result&lt;br /&gt;        o.Dart_Advertiser_Id__c = page.results.get(0).id;&lt;br /&gt;        update o;&lt;br /&gt;       }&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    // Redirect the user back to the original page&lt;br /&gt;    PageReference pageRef = new PageReference('/' + theId);&lt;br /&gt;    pageRef.setRedirect(true);&lt;br /&gt;    return pageRef;&lt;br /&gt;  }&lt;br /&gt;}&lt;/pre&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;UI updates&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Then, I created a simple Visuaforce page to invoke the controller:&lt;/span&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&amp;lt;apex:page standardController=&amp;quot;Account&amp;quot;  extensions=&amp;quot;SyncDartAccountController&amp;quot; action=&amp;quot;{!onLoad}&amp;quot;&amp;gt;&lt;br /&gt;  &amp;lt;apex:sectionHeader title=&amp;quot;Auto-Running Apex Code&amp;quot;/&amp;gt;&lt;br /&gt;  &amp;lt;apex:outputPanel &amp;gt;&lt;br /&gt;   You tried calling Apex Code from a button.  If you see this page, something went wrong.  &lt;br /&gt;        You should have been redirected back to the record you clicked the button from.&lt;br /&gt;  &amp;lt;/apex:outputPanel&amp;gt;&lt;br /&gt;&amp;lt;/apex:page&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; "&gt;&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Finally, I added a custom button to the Account page which would invoke the Visualforce page.  You can do this in the Salesforce UI:&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;1) Click on ‘Buttons and Links’:&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;img src="https://lh3.googleusercontent.com/74Gjg-sRHcAGcLN1HWEVCGsDkC-v1vvqin-J6aa_O_UyhPU8gGJTQRjqlQnaXceMeBruA_Ak93k0XR2mBQ3vqniDNH4-c5-iHtMWA3mnmT15eV5asA" width="220px;" height="318px;" /&gt;&lt;span style="font-size: 24pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;2) Click New:&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;img src="https://lh6.googleusercontent.com/vkj4yqkyjEMF8pfrYu7XpJZrMsZaOWtMP5WuzMosWYZ8urZn0tdxnc1iPTwwTkirPeVsr7_4EyPKA5ktLVC-6NSaq3iTc4AksCAvkHL8JaUmbYxZLA" width="358px;" height="43px;" /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;3) Enter the info for the new button:&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; background-color: transparent; "&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;img src="https://lh4.googleusercontent.com/qUKrgE3VPnLx8XTQqzL75GgrCH835CDBGhN5edN1KtYRNoy9wDWUoGcs_ar6J5RP4F2RX-vwERb_kxos5mOum0nxpKAzcHkMgHnMA-00BuCZdpnllA" width="672px;" height="369px;" /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;4) After clicking on Save, we can add the button to the Account page layout.  The final result:&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;img src="https://lh3.googleusercontent.com/e1mIr4kh-PuxNR0CnySZ5HVeuhgZewb5v5c94PsUb9x9zlJd6EvCvI9qjgtWcuSg07IT558AwiXu6M4LrLdDVbc1VbZj9oTJNzrpGCLnwR1y0PSv0Q" width="615px;" height="230px;" /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 24pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;Final Thoughts&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: 11pt; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: normal; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap; "&gt;This was my first foray into APEX programming in Salesforce and I was pleased with the overall set of tools and ability to be productive quickly.  The only hiccup I encoutered was in the WSDL generation step and this issue was fairly easy to overcome.  There are good developer docs and there are ways to add debug logging (which I didn’t go over) as well as a framework for unit testing.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-1808173114239880271?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/1808173114239880271/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=1808173114239880271' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1808173114239880271'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1808173114239880271'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/09/salesforce-and-dart-synchronization.html' title='Salesforce and DART Synchronization'/><author><name>Timo</name><uri>http://www.blogger.com/profile/05949421779840031276</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_AQ2q1Jl3xrU/TJifw8ApaiI/AAAAAAAAABc/go60HqKG42Q/s72-c/salesforce_hackday.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-2766033128148468303</id><published>2010-09-20T13:59:00.001-07:00</published><updated>2010-09-20T13:59:03.808-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><category scheme='http://www.blogger.com/atom/ns#' term='emr'/><title type='text'>quick script: emr-mailer</title><content type='html'>We write a lot of &lt;a href="http://dev.bizo.com/search/label/hive"&gt;hive&lt;/a&gt; reports.  Frequently we want to email the resulting report to a list.  In the past I've usually done this with some one-off post processing scripts, but I thought it would be nice to write a reusable &lt;a href="http://aws.amazon.com/elasticmapreduce/"&gt;emr&lt;/a&gt; job step that will execute as part of the hive job.&lt;br /&gt;&lt;br /&gt;The script will download files from an s3 url, concatenate them together, zip up the results and send it as an attachment to a specified email address.  It sends email through smtp.mail.com, using account credentials you specify.&lt;br /&gt;&lt;br /&gt;I wanted to make it easy to just append an additional step to any existing job, not requiring any additional machine setup or dependencies.  I was able do this by making use of amazon's script-runner (s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar).  The script-runner.jar step will let you execute an arbitrary script from a location in s3 as an emr job step.&lt;br /&gt;&lt;br /&gt;As I mentioned, the intended usage is to run it as a job step with your hive script, passing it in the location of the resulting report.&lt;br /&gt;&lt;br /&gt;E.g.:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;elastic-mapreduce --create --name "my awesome report ${MONTH}" \&lt;br /&gt;   --num-instances 10 --instance-type c1.medium  --hadoop-version 0.20 \&lt;br /&gt;   --hive-script --arg s3://path/to/hive/script.sql \&lt;br /&gt;   --args -d,MONTH=${MONTH} --args -d,START=${START} --args -d,END=${END} \&lt;br /&gt;   --jar s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar \&lt;br /&gt;   --args s3://path/to/emr-mailer/send-report.rb \&lt;br /&gt;   --args -n,report_${MONTH} --args -s,"my awesome report ${MONTH}" \&lt;br /&gt;   --args -e,awesome-reports@company.com \&lt;br /&gt;   --args -r,s3://path/to/report/results&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Above you can see I'm starting a hive report as normal, then simply appending the script-runner step, calling the emr-mailer send-report.rb, telling it where the report will end up, and details about the email.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The full source code is available on github as &lt;a href="http://github.com/ogrodnek/emr-mailer"&gt;emr-mailer&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The script is pretty simple, but let me know if you have any suggestions for improvements or other feedback.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-2766033128148468303?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/2766033128148468303/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=2766033128148468303' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2766033128148468303'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2766033128148468303'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/09/quick-script-emr-mailer.html' title='quick script: emr-mailer'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-4245737772295205599</id><published>2010-08-13T20:18:00.000-07:00</published><updated>2010-08-13T20:26:31.458-07:00</updated><title type='text'>Collecting User Actions with GWT</title><content type='html'>&lt;p&gt;While I was at one of the &lt;a href='http://code.google.com/events/io/2010/'&gt;Google I/O&lt;/a&gt; GWT sessions (courtesy of &lt;a href='http://www.bizo.com'&gt;Bizo&lt;/a&gt;), a Google presenter mentioned how one of their internal GWT applications tracks user actions.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The idea is really just a souped-up, AJAX version of server-side access logs: capturing, buffering, and sending fine-grained user actions up to the server for later analysis.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The Google team was using this data to make A/B-testing-style decisions about features&amp;#8211;which ones were being used, &lt;em&gt;not&lt;/em&gt; being used, tripping users up, etc.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;I thought the idea was pretty nifty, so I flushed out an initial implementation in &lt;a href='http://bizads.bizo.com'&gt;BizAds&lt;/a&gt; for &lt;a href='http://www.bizo.com'&gt;Bizo&amp;#8217;s&lt;/a&gt; recent hack day. And now I am documenting my wild success for &lt;a href='http://www.bizo.com'&gt;Bizo&amp;#8217;s&lt;/a&gt; first post-hack-day &amp;#8220;beer &amp;amp; blogs&amp;#8221; day.&lt;/p&gt;&lt;br /&gt;&lt;h2 id='no_access_logs'&gt;No Access Logs&lt;/h2&gt;&lt;br /&gt;&lt;p&gt;Traditional, page-based web sites typically use access logs for site analytics. For example, the user was on &lt;code&gt;a.html&lt;/code&gt;, then &lt;code&gt;b.html&lt;/code&gt;. Services like &lt;a href='http://www.google.com/analytics'&gt;Google Analytics&lt;/a&gt; can then slice and dice your logs to tell you interesting things.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;However, desktop-style one-page webapps don&amp;#8217;t generate these access logs&amp;#8211;the user is always on the first page&amp;#8211;so they must rely on something else.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;This is pretty normal for AJAX apps, and Google Analytics already supports it via its asynchronous API.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;We had already been doing this from GWT with code like:&lt;/p&gt;&lt;br /&gt;&lt;pre class='brush:java'&gt;&lt;code class=prettyprint&gt;public native void trackInGA(final String pageName) /*-{&lt;br /&gt;  $wnd._gaq.push([&amp;#39;_trackPageview&amp;#39;, pageName]);&lt;br /&gt;}-*/;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;p&gt;And since we&amp;#8217;re using a MVP/places-style architecture (see &lt;a href='http://www.gwtmpv.org'&gt;gwt-mpv&lt;/a&gt;), we just call this on each place change. Done.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Google Analytics is back in action, not a big deal.&lt;/p&gt;&lt;br /&gt;&lt;h2 id='beyond_access_logs'&gt;Beyond Access Logs&lt;/h2&gt;&lt;br /&gt;&lt;p&gt;What was novel, to me, about this internal Google application&amp;#8217;s approach was how the tracked user actions were much more fine-grained than just &amp;#8220;page&amp;#8221; level.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;For example, which buttons the user hovers over. Which ones they click (even if it doesn&amp;#8217;t lead to a page load). What client-side validation messages are tripping them up. Any number of small &amp;#8220;intra-page&amp;#8221; things that are nonetheless useful to know.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Obviously there are a few challenges, mostly around not wanting to detract from the user experience:&lt;/p&gt;&lt;br /&gt;&lt;ul&gt; &lt;br /&gt;&lt;li&gt;How much data is too much?&lt;br /&gt;&lt;p&gt;Tracking the mouse over of every element would be excessive. But the mouse over of key elements? Should be okay.&lt;/p&gt; &lt;br /&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt; How often to send the data?&lt;br /&gt;&lt;p&gt;If you wait too long while buffering user actions before uploading them to the server, the user may leave the page and you&amp;#8217;ll lose them. (Unless you use a page unload hook, and the browser hasn&amp;#8217;t crashed.)&lt;/p&gt;&lt;br /&gt;&lt;p&gt;If you send data too often, the user might get annoyed.&lt;/p&gt; &lt;br /&gt;&lt;/li&gt; &lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;p&gt;The key to doing this right is having metrics in place to know whether you&amp;#8217;re prohibitively affecting the user experience.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The internal Google team had these metrics for their application, and that allowed them to start out batch uploading actions every 30 seconds, then every 20 seconds, and finally every 3 seconds. Each time they could tell the users&amp;#8217; experience was not adversely affected.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Unfortunately, I don&amp;#8217;t know what exactly this metric was (I should have asked), but I imagine it&amp;#8217;s fairly application-specific&amp;#8211;think of GMail and average emails read/minute or something like that.&lt;/p&gt;&lt;br /&gt;&lt;h2 id='implementation'&gt;Implementation&lt;/h2&gt;&lt;br /&gt;&lt;p&gt;I was able to implement this concept rather easily, mostly by reusing existing infrastructure our GWT application already had.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;When interesting actions occur, I have the presenters fire a generic &lt;code&gt;UserActionEvent&lt;/code&gt;, which is generated using &lt;a href='http://github.com/stephenh/gwt-mpv-apt'&gt;gwt-mpv-apt&lt;/a&gt; from this spec:&lt;/p&gt;&lt;br /&gt;&lt;pre class='brush:java'&gt;&lt;code class=prettyprint&gt;@GenEvent&lt;br /&gt;public class UserActionEventSpec {&lt;br /&gt;  @Param(1)&lt;br /&gt;  String name;&lt;br /&gt;  @Param(2)&lt;br /&gt;  String value;&lt;br /&gt;  @Param(3)&lt;br /&gt;  boolean flushNow;&lt;br /&gt;}&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;p&gt;Initiating the tracking an action is now just as simple as firing an event:&lt;/p&gt;&lt;br /&gt;&lt;pre class='brush:java'&gt;&lt;code class=prettyprint&gt;UserActionEvent.fire(&lt;br /&gt;   eventBus,&lt;br /&gt;   &amp;quot;someAction&amp;quot;,&lt;br /&gt;   &amp;quot;someValue&amp;quot;,&lt;br /&gt;   false);&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;p&gt;I have a separate, decoupled &lt;code&gt;UserActionUploader&lt;/code&gt;, which is listening for these events and buffers them into a client-side list of &lt;code&gt;UserAction&lt;/code&gt; DTOs:&lt;/p&gt;&lt;br /&gt;&lt;pre class='brush:java'&gt;&lt;code class=prettyprint&gt;private class OnUserAction implements UserActionHandler {&lt;br /&gt;  public void onUserAction(final UserActionEvent event) {&lt;br /&gt;    UserAction action = new UserAction();&lt;br /&gt;    action.user = defaultString(getEmailAddress(), &amp;quot;unknown&amp;quot;);&lt;br /&gt;    action.name = event.getName();&lt;br /&gt;    action.value = event.getValue();&lt;br /&gt;    actions.add(action);&lt;br /&gt;    if (event.getFlushNow()) {&lt;br /&gt;      flush();&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;}&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;p&gt;&lt;code&gt;UserActionUploader&lt;/code&gt; sets a timer that every 3 seconds calls &lt;code&gt;flush&lt;/code&gt;:&lt;/p&gt;&lt;br /&gt;&lt;pre class='brush:java'&gt;&lt;code class=prettyprint&gt;private void flush() {&lt;br /&gt;  if (actions.size() == 0) {&lt;br /&gt;    return;&lt;br /&gt;  }&lt;br /&gt;  ArrayList&amp;lt;UserAction&amp;gt; copy =&lt;br /&gt;    new ArrayList&amp;lt;UserAction&amp;gt;(actions);&lt;br /&gt;  actions.clear();&lt;br /&gt;  async.execute(&lt;br /&gt;    new SaveUserActionAction(copy),&lt;br /&gt;    new OnSaveUserActionResult());&lt;br /&gt;}&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;p&gt;The &lt;code&gt;flush&lt;/code&gt; method uses &lt;a href='http://code.google.com/p/gwt-dispatch/'&gt;gwt-dispatch&lt;/a&gt;-style action/result classes, also generated by &lt;a href='http://github.com/stephenh/gwt-mpv-apt'&gt;gwt-mpv-apt&lt;/a&gt;, to the server via GWT-RPC:&lt;/p&gt;&lt;br /&gt;&lt;pre class='brush:java'&gt;&lt;code class=prettyprint&gt;@GenDispatch&lt;br /&gt;public class SaveUserActionSpec {&lt;br /&gt;  @In(1)&lt;br /&gt;  ArrayList&amp;lt;UserAction&amp;gt; actions;&lt;br /&gt;}&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;p&gt;This results in &lt;code&gt;SaveUserActionAction&lt;/code&gt; (okay, bad name) and &lt;code&gt;SaveUserActionResult&lt;/code&gt; DTOs getting generated, with nice constructors, getters, setters, etc.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;On the server-side, I was able to reuse an excellent &lt;code&gt;DatalogManager&lt;/code&gt; class from one of my &lt;a href='http://www.bizo.com'&gt;Bizo&lt;/a&gt; colleagues (unfortunately not open source (yet?)) that buffers the actions data on the server&amp;#8217;s hard disk and then periodically uploads the files to Amazon&amp;#8217;s S3.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Once the data is in S3, it&amp;#8217;s pretty routine to setup a Hive job to read it, do any fancy reporting (grouping/etc.), and drop it into a CSV file. For now I&amp;#8217;m just listing raw actions:&lt;/p&gt;&lt;br /&gt;&lt;pre class='brush:sql'&gt;&lt;code class=prettyprint&gt;-- Pick up the DatalogManager files in S3&lt;br /&gt;drop table dlm_actions;&lt;br /&gt;create external table dlm_actions (&lt;br /&gt;    d map&amp;lt;string, string&amp;gt;&lt;br /&gt;)&lt;br /&gt;partitioned by (dt string comment &amp;#39;yyyyddmmhh&amp;#39;)&lt;br /&gt;row format delimited&lt;br /&gt;fields terminated by &amp;#39;\n&amp;#39; collection items terminated by &amp;#39;\001&amp;#39; map keys terminated by &amp;#39;\002&amp;#39;&lt;br /&gt;location &amp;#39;s3://&amp;lt;actions-dlm-bucket&amp;gt;/&amp;lt;folder&amp;gt;/&amp;#39;&lt;br /&gt;;&lt;br /&gt; &lt;br /&gt;alter table dlm_actions recover partitions;&lt;br /&gt; &lt;br /&gt;-- Make a csv destination also in S3&lt;br /&gt;create external table csv_actions (&lt;br /&gt;    user string,&lt;br /&gt;    action string,&lt;br /&gt;    value string&lt;br /&gt;)&lt;br /&gt;row format delimited fields terminated by &amp;#39;,&amp;#39;&lt;br /&gt;location &amp;#39;s3://&amp;lt;actions-report-bucket/${START}-${END}/parts&amp;#39;&lt;br /&gt;;&lt;br /&gt; &lt;br /&gt;-- Move the data over (nothing intelligent yet)&lt;br /&gt;insert overwrite table csv_actions&lt;br /&gt;select dlm.d[&amp;quot;USER&amp;quot;], dlm.d[&amp;quot;ACTION&amp;quot;], dlm.d[&amp;quot;VALUE&amp;quot;]&lt;br /&gt;from dlm_actions dlm&lt;br /&gt;where&lt;br /&gt;    dlm.dt &amp;gt;= &amp;#39;${START}00&amp;#39; and dlm.dt &amp;lt; &amp;#39;${END}00&amp;#39;&lt;br /&gt;;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;p&gt;Then we use Hudson as a cron-with-a-GUI to run this Hive script as an Amazon Elastic Map Reduce job once per day.&lt;/p&gt;&lt;br /&gt;&lt;h2 id='testing'&gt;Testing&lt;/h2&gt;&lt;br /&gt;&lt;p&gt;Thanks to the awesomeness of &lt;a href='http://www.gwtmpv.org'&gt;gwt-mpv&lt;/a&gt;, the usual GWT widgets, GWT-RPC, etc., can be doubled-out and testing with pure-Java unit tests.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;For example, a method from &lt;code&gt;UserActionUploaderTest&lt;/code&gt;:&lt;/p&gt;&lt;br /&gt;&lt;pre class='brush:java'&gt;&lt;code class=prettyprint&gt;UserActionUploader uploader = new UserActionUploader(registry);&lt;br /&gt;StubTimer timer = (StubTimer) uploader.getTimer();&lt;br /&gt; &lt;br /&gt;@Test&lt;br /&gt;public void uploadIsBuffered() {&lt;br /&gt;  eventBus.fireEvent(new UserActionEvent(&amp;quot;someaction&amp;quot;, &amp;quot;value1&amp;quot;, false));&lt;br /&gt;  eventBus.fireEvent(new UserActionEvent(&amp;quot;someaction&amp;quot;, &amp;quot;value2&amp;quot;, false));&lt;br /&gt;  assertThat(async.getOutstanding().size(), is(0)); // buffered&lt;br /&gt; &lt;br /&gt;  timer.run();&lt;br /&gt;  final SaveUserActionAction a1 = async.getAction(SaveUserActionAction.class);&lt;br /&gt;  assertThat(a1.getActions().size(), is(2));&lt;br /&gt;  assertAction(a1, 0, &amp;quot;anonymous&amp;quot;, &amp;quot;someaction&amp;quot;, &amp;quot;value1&amp;quot;);&lt;br /&gt;  assertAction(a1, 1, &amp;quot;anonymous&amp;quot;, &amp;quot;someaction&amp;quot;, &amp;quot;value2&amp;quot;);&lt;br /&gt;}&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;p&gt;The usual GWT timers are stubbed out by a &lt;code&gt;StubTimer&lt;/code&gt;, which we can manually tick via &lt;code&gt;timer.run()&lt;/code&gt; to deterministically test timer-delayed business logic.&lt;/p&gt;&lt;br /&gt;&lt;h2 id='thats_it'&gt;That&amp;#8217;s It&lt;/h2&gt;&lt;br /&gt;&lt;p&gt;I can&amp;#8217;t say we have made any feature-altering decisions for &lt;a href='http://bizads.bizo.com'&gt;BizAds&lt;/a&gt; based on the data gathered from this approach yet&amp;#8211;technically its not live yet. But it&amp;#8217;s so amazing that surely we will. Ask me about it sometime in the future.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-4245737772295205599?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/4245737772295205599/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=4245737772295205599' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4245737772295205599'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4245737772295205599'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/08/collecting-user-actions-with-gwt.html' title='Collecting User Actions with GWT'/><author><name>Stephen Haberman</name><uri>http://www.blogger.com/profile/05412274950722949930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-4312715384172818904</id><published>2010-08-13T15:44:00.001-07:00</published><updated>2010-08-13T15:44:16.280-07:00</updated><title type='text'>hackday: analog meters</title><content type='html'>For this last hackday, I decided to work on something more hardware hacking related.  At this year's &lt;a href="http://makerfaire.com/"&gt;Maker Fair&lt;/a&gt;, I was really inspired by all the cool stuff people were building, so I picked up an &lt;a href="http://http://www.arduino.cc/"&gt;arduino&lt;/a&gt; and started playing around with a couple of things.&lt;br /&gt;&lt;br /&gt;I've always wanted to have some cool old-school analog VU type meters displaying web requests.&lt;br /&gt;&lt;br /&gt;Here's my completed hackday project:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/logrodnek/4886748674/"&gt;&lt;img src="http://farm5.static.flickr.com/4073/4886748674_69d378a309.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Here's a view of the components from the back:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/logrodnek/4886751606/in/photostream/"&gt;&lt;img src="http://farm5.static.flickr.com/4096/4886751606_76ebe9ed1c.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;It's battery operated and receives data wirelessly over RF from another arduino I have hooked up via serial to my laptop.&lt;br /&gt;&lt;br /&gt;It's pretty simple, but I'm still totally psyched about how it came out.&lt;br /&gt;&lt;br /&gt;The main components are some &lt;a href="http://www.adafruit.com/index.php?main_page=product_info&amp;cPath=37&amp;products_id=252"&gt;analog panel meters&lt;/a&gt; (kinda pricey, but awesome), and an &lt;a href="http://www.sparkfun.com/commerce/product_info.php?products_id=8948"&gt;RF receiver&lt;/a&gt;.  The frame is a piece of scrap acrylic from &lt;a href="http://www.tapplastics.com/"&gt;TAP Plastics&lt;/a&gt; that I drilled and cut to size, and the stand is a piece of a wire clothes hanger bent to shape.&lt;br /&gt;&lt;br /&gt;Connected to my computer is a another arduino (actually a &lt;a href="http://www.appliedplatonics.com/volksduino/"&gt;volksduino&lt;/a&gt;) that receives updates over USB and sends the data out over RF:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/logrodnek/4886752598/in/set-72157624712610496/"&gt;&lt;img src="http://farm5.static.flickr.com/4142/4886752598_2522c0839b_d.jpg"&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;You may be asking, why bother with wireless if you need a computer hooked up through serial anyway.  Or you may ask why not just connect to a wireless network directly.&lt;br /&gt;&lt;br /&gt;Well, I wanted the meters to be able to be moved around, or mounted on a wall... I wanted them wireless.  But, it turns out that wireless and even ethernet solutions for connecting an arduino to the internet directly are comparatively pretty expensive.  Even using bluetooth is expensive.  My long term plan is to have a single arduino connected to the internet directly (via ethernet or wireless), and have it serve as a proxy over RF for the others...  So this is a bit of work towards that.&lt;br /&gt;&lt;br /&gt;I wrote a bit of Java code to connect to amazon's cloudwatch to pull the load balancer statistics for two of our services.  I then discovered it's near impossible to connect to anything over USB in Java...  It is ridiculous.  Luckily, it's REALLY easy to do this with &lt;a href="http://processing.org/"&gt;Processing&lt;/a&gt;, so I wrote a simple processing program that used my cloudwatch library and wrote it out to serial.&lt;br /&gt;&lt;br /&gt;And that's really it.  The arduino reads data over serial, and periodically sends it over RF.  The arduino hooked up to the meters simply reads the values over RF and sets the meters to display a scaled version of the results.  They're showing requests per second.  We get a huge amount of requests per second with these services, so the numbers on the dial aren't actually correct (I need to make some custom faceplates).  It also flashes an LED every time it gets a RF transmission.&lt;br /&gt;&lt;br /&gt;Here's a quick video of it in action:&lt;br /&gt;&lt;br /&gt;&lt;object type="application/x-shockwave-flash" width="400" height="227" data="http://www.flickr.com/apps/video/stewart.swf?v=71377" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"&gt;&lt;param name="flashvars" value="intl_lang=en-us&amp;photo_secret=0efb547404&amp;photo_id=4889370510"&gt;&lt;/param&gt;&lt;param name="movie" value="http://www.flickr.com/apps/video/stewart.swf?v=71377"&gt;&lt;/param&gt;&lt;param name="bgcolor" value="#000000"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;embed type="application/x-shockwave-flash" src="http://www.flickr.com/apps/video/stewart.swf?v=71377" bgcolor="#000000" allowfullscreen="true" flashvars="intl_lang=en-us&amp;photo_secret=0efb547404&amp;photo_id=4889370510" height="227" width="400"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The one thing I'm not crazy about is that the maximum resolution you can get from cloudwatch is stats per minute, so the meters don't actually change as often as I would like.&lt;br /&gt;&lt;br /&gt;Still, pretty cool.  I'm looking forward to building some more displays like this in the future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-4312715384172818904?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/4312715384172818904/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=4312715384172818904' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4312715384172818904'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4312715384172818904'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/08/hackday-analog-meters.html' title='hackday: analog meters'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm5.static.flickr.com/4073/4886748674_69d378a309_t.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7887618721755558062</id><published>2010-07-29T16:52:00.000-07:00</published><updated>2010-09-20T12:31:29.219-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><category scheme='http://www.blogger.com/atom/ns#' term='udtf'/><title type='text'>Extending Hive with Custom UDTFs</title><content type='html'>&lt;p&gt;Let’s take a look at the canonical word count example in Hive: given a table of documents, create a table containing each word and the number of times it appears across all documents.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Here’s one implementation from &lt;a href="http://www.facebook.com/note.php?note_id=89508453919"&gt;the Facebook engineers&lt;/a&gt;:&lt;/p&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;CREATE TABLE docs(contents STRING);&lt;br /&gt;&lt;br /&gt;FROM (&lt;br /&gt;  MAP docs.contents &lt;br /&gt;  USING 'tokenizer_script' &lt;br /&gt;  AS &lt;br /&gt;    word, &lt;br /&gt;    cnt&lt;br /&gt;  FROM docs&lt;br /&gt;  CLUSTER BY word&lt;br /&gt;) map_output&lt;br /&gt;REDUCE map_output.word, map_output.cnt &lt;br /&gt;USING 'count_script' &lt;br /&gt;AS &lt;br /&gt;  word, &lt;br /&gt;  cnt&lt;br /&gt;;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;In this example, the heavy lifting is being done by calling out to two scripts, ‘tokenizer_script’ and ‘count_script’, that provide custom mapper logic and reducer logic.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Hive 0.5 adds User Defined Table-Generating Functions (UDTF), which offers another option for inserting custom mapper logic.  (Reducer logic can be plugged in via a User Defined Aggregation Function, the subject of a future post.)  From a user perspective, UDTFs are similar to User Defined Functions except they can produce an arbitrary number of output rows for each input row.  For example, the built-in UDTF “explode(array A)” converts a single row of input containing an array into multiple rows of output, each containing one of the elements of A.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;So, let’s implement a UDTF that does the same thing as the ‘tokenizer_script’ in the word count example.  Basically, we want to convert a document string into multiple rows with the format (word STRING, cnt INT), where the count will always be one.&lt;/p&gt;&lt;br /&gt;&lt;h3&gt;The Tokenizer UDTF&lt;/h3&gt;&lt;br /&gt;&lt;p&gt;To start, we extend the org.apache.hadoop.hive.ql.udf.generic.GenericUDTF class.  (There is no plain UDTF class.)  We need to implement three methods: initialize, process, and close.  To emit output, we call the forward method.&lt;/p&gt;&lt;br /&gt;&lt;h3&gt;Adding a name and description:&lt;/h3&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;@description(name = "tokenize", value = "_FUNC_(doc) - emits (token, 1) for each token in the input document")&lt;br /&gt;public class TokenizerUDTF extends GenericUDTF {&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;You can add a UDTF name and description using a @description annotation.  These will be available on the Hive console via the show functions and describe function tokenize commands.&lt;/p&gt;&lt;br /&gt;&lt;h3&gt;The initialize method:&lt;/h3&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;  public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;This method will be called exactly once per instance.  In addition to performing any custom initialization logic you may need, it is responsible for verifying the input types and specifying the output types.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;Hive uses a system of ObjectInspectors to both describe types and to convert Objects into more specific types.  For our tokenizer, we want a single String as input, so we’ll check that the input ObjectInspector[] array contains a single PrimitiveObjectInspector of the STRING category.  If anything is wrong, we throw a UDFArgumentException with a suitable error message.&lt;/p&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;    if (args.length != 1) {&lt;br /&gt;      throw new UDFArgumentException("tokenize() takes exactly one argument");&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE&lt;br /&gt;        &amp;&amp; ((PrimitiveObjectInspector) args[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {&lt;br /&gt;      throw new UDFArgumentException("tokenize() takes a string as a parameter");&lt;br /&gt;    }&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;We can actually use this object inspector to convert inputs into Strings in our process method.  This is less important for primitive types, but it can be handy for more complex objects.  So, assuming stringOI is an instance variable,&lt;/p&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;    stringOI = (PrimitiveObjectInspector) args[0];&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;Similarly, we want our process method to return an Object[] array containing a String and an Integer, so we’ll return a StandardStructObjectInspector containing a JavaStringObjectInspector and a JavaIntObjectInspector.  We’ll also supply names for these output columns, but they’re not really relevant at runtime since the user will supply his or her own aliases.&lt;/p&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;    List&lt;String&gt; fieldNames = new ArrayList&lt;String&gt;(2);&lt;br /&gt;    List&lt;ObjectInspector&gt; fieldOIs = new ArrayList&lt;ObjectInspector&gt;(2);&lt;br /&gt;    fieldNames.add("word");&lt;br /&gt;    fieldNames.add("cnt");&lt;br /&gt;    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);&lt;br /&gt;    fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);&lt;br /&gt;    return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);&lt;br /&gt;  }&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;h3&gt;The process method:&lt;/h3&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;  public void process(Object[] record) throws HiveException&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;This method is where the heavy lifting occurs.  This gets called for each row of the input.  The first task is to convert the input into a single String containing the document to process:&lt;/p&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;    String document = (String) stringOI.getPrimitiveJavaObject(record[0]);&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;We can now implement our custom logic:&lt;/p&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;    if (document == null) {&lt;br /&gt;      return;&lt;br /&gt;    }&lt;br /&gt;    String[] tokens = document.split(“\\s+”);&lt;br /&gt;    for (String token : tokens) {&lt;br /&gt;      forward(new Object[] { token, Integer.valueOf(1) });&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;h3&gt;The close method:&lt;/h3&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;  public void close() throws HiveException { }&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;This method allows us to do any post-processing cleanup.  Note that the output stream has already been closed at this point, so this method cannot emit more rows by calling forward.  In our case, there’s nothing to do here.&lt;/p&gt;&lt;br /&gt;&lt;h3&gt;Packaging and use:&lt;/h3&gt;&lt;br /&gt;&lt;p&gt;We deploy our TokenizeUDTF exactly like a UDF.  We deploy the jar file to our Hive machine and enter the following in the console:&lt;/p&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;&amp;gt; add jar TokenizeUDTF.jar ;&lt;br /&gt;&amp;gt; create temporary function tokenize as ’com.bizo.hive.udtf.TokenizeUDTF’ ;&lt;br /&gt;&amp;gt; select tokenize(contents) as (word, cnt) from docs ;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;p&gt;This gives us the intermediate mapped data, ready to be reduced by a custom UDAF.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;The code for this example is available in this &lt;a href="http://gist.github.com/499319"&gt;gist&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7887618721755558062?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7887618721755558062/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7887618721755558062' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7887618721755558062'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7887618721755558062'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/07/extending-hive-with-custom-udtfs.html' title='Extending Hive with Custom UDTFs'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-378204440215064888</id><published>2010-06-25T09:31:00.001-07:00</published><updated>2010-06-30T10:48:55.539-07:00</updated><title type='text'>Come work at Bizo</title><content type='html'>Want to work on some interesting problems with a great development team?&lt;br /&gt;&lt;br /&gt;We're looking to hire a &lt;a href="http://sfbay.craigslist.org/sfc/sof/1809617881.html"&gt;junior developer&lt;/a&gt; and a &lt;a href="http://sfbay.craigslist.org/sfc/eng/1819187042.html"&gt;quantitative engineer&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Send us your resume!  Or if you know someone in the Bay Area that you think might be a good fit, please forward the posting to them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-378204440215064888?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/378204440215064888/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=378204440215064888' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/378204440215064888'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/378204440215064888'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/06/come-work-at-bizo.html' title='Come work at Bizo'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-16170570068014157</id><published>2010-06-08T11:34:00.000-07:00</published><updated>2010-10-21T23:06:07.992-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='hackday'/><title type='text'>Accessing Bizo API using Ruby OAuth</title><content type='html'>During a recent HackDay, I wrote a search-engine-like frontend to our multi-dimensional business demographic database.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_hTaTFvFyFE4/TA6Iip4r2yI/AAAAAAAAAAU/qQovrCpjmPo/s1600/nok-nok.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 190px;" src="http://2.bp.blogspot.com/_hTaTFvFyFE4/TA6Iip4r2yI/AAAAAAAAAAU/qQovrCpjmPo/s400/nok-nok.jpg" alt="" id="BLOGGER_PHOTO_ID_5480467925497010978" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Among other things, the interface provides search term suggestions based on business title classification, using the &lt;a href="http://developer.bizo.com/documentation/4-classify-api"&gt;Classify&lt;/a&gt; operation of the &lt;a href="http://developer.bizo.com/"&gt;Bizo API&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I chose to use Ruby and the lightweight &lt;a href="http://www.sinatrarb.com/"&gt;Sinatra&lt;/a&gt; web framework for fast prototyping and since the Bizo API uses  OAuth for authentication, I reached out for the excellent &lt;a href="http://oauth.rubyforge.org/"&gt;OAuth&lt;/a&gt; gem.&lt;br /&gt;&lt;br /&gt;Now, while the documentation is good it took me a little time to grok the OAuth API and figure out how to use it.  The Bizo API does not use a RequestToken; instead we use an API key and a shared secret.  Since the OAuth gem documentation didn't include any example for this use-case, I figured I'd post my code here as a starting point for other people to reuse.&lt;br /&gt;&lt;br /&gt;Without further ado, here's the short code fragment:&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;require 'rubygems'&lt;br /&gt;require 'oauth'&lt;br /&gt;require 'oauth/consumer'&lt;br /&gt;require 'json'&lt;br /&gt;&lt;br /&gt;key = 'xxxxxxxx'&lt;br /&gt;secret = 'yyyyyyyy'&lt;br /&gt;&lt;br /&gt;consumer = OAuth::Consumer.new(key, secret, {&lt;br /&gt;  :site         =&gt; "http://api.bizographics.com",&lt;br /&gt;  :scheme       =&gt; :query_string,&lt;br /&gt;  :http_method  =&gt; :get&lt;br /&gt;})&lt;br /&gt;&lt;br /&gt;title = "VP of Marketing"&lt;br /&gt;path = URI.escape("/v1/classify.json?api_key=#{key}&amp;title=#{title}")&lt;br /&gt;&lt;br /&gt;response = consumer.request(:get, path)&lt;br /&gt;&lt;br /&gt;# Display response&lt;br /&gt;p JSON.parse(response.body)&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;If you're curious, here's the JSON response for the "VP of Market..." title classification,&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;{&lt;br /&gt;  "usage" =&gt; 1,&lt;br /&gt;  "bizographics" =&gt; {&lt;br /&gt;  "group" =&gt; { "name" =&gt; "High Net Worth", "code" =&gt; "high_net_worth" },&lt;br /&gt;  "functional_area" =&gt; [&lt;br /&gt;    {"name" =&gt; "Sales", "code" =&gt; "sales" },&lt;br /&gt;    {"name" =&gt; "Marketing", "code" =&gt; "marketing" }&lt;br /&gt;  ],&lt;br /&gt;  "seniority" =&gt; {"name" =&gt; "Executives", "code" =&gt; "executive" }&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Hopefully this is useful to Rubyists out there needing quick OAuth integration using HTTP GET and a query string and don't need to go through the token exchange process.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-16170570068014157?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/16170570068014157/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=16170570068014157' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/16170570068014157'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/16170570068014157'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/05/accessing-bizo-api-using-ruby-oauth.html' title='Accessing Bizo API using Ruby OAuth'/><author><name>Alex Boisvert</name><uri>http://www.blogger.com/profile/05164682765137205886</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_hTaTFvFyFE4/TA6Iip4r2yI/AAAAAAAAAAU/qQovrCpjmPo/s72-c/nok-nok.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8544296362130877194</id><published>2010-05-11T15:49:00.001-07:00</published><updated>2010-05-11T15:49:29.358-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mongodb'/><category scheme='http://www.blogger.com/atom/ns#' term='jersey'/><category scheme='http://www.blogger.com/atom/ns#' term='scala'/><title type='text'>Hackday: dependency searching using scala, jersey, gxp, mongodb</title><content type='html'>For my hackday project, I thought I would try to build an internal tool to let us more easily search our dependency repository.  We use &lt;a href="http://ant.apache.org/ivy/"&gt;ivy&lt;/a&gt; for dependency management, and maintain our own repository in s3.  It can be kind of a pain to track down the latest version of library X, especially if you're not sure what the organization is, or maybe you know the org and not the name.  It seemed like a fun, useful project that I could tackle in a day, and that would allow me to play around with a couple of things I was interested in.  To build it, I used &lt;a href="https://jersey.dev.java.net/"&gt;jersey&lt;/a&gt;, &lt;a href="http://code.google.com/p/gxp/"&gt;gxp&lt;/a&gt;, and &lt;a href="http://www.mongodb.org/"&gt;mongoDB&lt;/a&gt;. The whole thing was written using &lt;a href="http://www.scala-lang.org/"&gt;scala&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I borrowed the main layout from the &lt;a href="http://www.springsource.com/repository/app/"&gt;SpringSource Enterprise Bundle Repository&lt;/a&gt;.  I'm pretty happy with the results:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://com-bizo-public.s3.amazonaws.com/blog/hackday/dependency_search/dsearch.png"&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/hackday/dependency_search/dsearch.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;And the detail view:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://com-bizo-public.s3.amazonaws.com/blog/hackday/dependency_search/ddetail.png"&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/hackday/dependency_search/ddetail.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There's also a &lt;a target="_blank" href="http://com-bizo-public.s3.amazonaws.com/blog/hackday/dependency_search/dbrowse.png"&gt;browse view&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I've been really happy using scala and jersey, and I wanted something simple and easy for this project, so I thought it was worth a shot.  After adding GXP for templating support, I have to say the combination of scala/jersey/GXP makes a pretty compelling framework for simple web apps.&lt;br /&gt;&lt;br /&gt;As an example, here's the beginning of my 'Browse' Controller:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;@Path("/b")&lt;br /&gt;class Browse {&lt;br /&gt;  val db = new RepoDB &lt;br /&gt;&lt;br /&gt;  @Path("/o")&lt;br /&gt;  @GET @Produces(Array("text/html"))  &lt;br /&gt;  def browseOrg() = browseOrgLetter("A")&lt;br /&gt;  &lt;br /&gt;  @Path("/o/{letter}")&lt;br /&gt;  @GET @Produces(Array("text/html"))&lt;br /&gt;  def browseOrgLetter(@PathParam("letter") letter : String) = {&lt;br /&gt;    &lt;br /&gt;    val orgs = db.getOrgLetters&lt;br /&gt;    &lt;br /&gt;    val results = db.findByOrgLetter(letter, 30)&lt;br /&gt;    &lt;br /&gt;    BrowseView.getGxpClosure("Organization", "o", orgs, letter, results)&lt;br /&gt;  }&lt;br /&gt;  ...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;It's using nested paths, so /b/o is the main browse by organization page, /b/o/G would be all organizations starting with 'G'.&lt;br /&gt;&lt;br /&gt;Then, I have a simple MessageBodyWriter that can render a GxpClosure:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;@Provider&lt;br /&gt;@Produces(Array("text/html"))&lt;br /&gt;class GxpClosureWriter extends MessageBodyWriter[GxpClosure] {&lt;br /&gt;  val context = new GxpContext(Locale.US)&lt;br /&gt;&lt;br /&gt;  override def isWriteable(dataType: java.lang.Class[_], ...) = {&lt;br /&gt;    classOf[GxpClosure].isAssignableFrom(dataType)&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  override def writeTo(gxp: GxpClosure, ...) {&lt;br /&gt;    val out = new java.io.OutputStreamWriter(_out)&lt;br /&gt;    gxp.write(out, context)&lt;br /&gt;  }&lt;br /&gt;  ...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;And, that's really all there is to it.  Nice, simple, and lightweight.&lt;br /&gt;&lt;br /&gt;Last but not least, &lt;a href="http://www.mongodb.org/"&gt;mongodb&lt;/a&gt;.  It was probably overkill for this project, but I was looking for an excuse to play with it some more.  I use it to store and index all of the repository information.  I have a separate crawler process that lists everything in our repository s3 bucket, then stores an entry for each artifact.  As part of this, it does some basic tokenizing of the organization and artifact names for searching.  Searching like this was a little disappointing compared to &lt;a href="http://lucene.apache.org/java/docs/"&gt;lucene&lt;/a&gt;.  Overall though, I'm pretty happy with it.  Browsing and searching are both ridiculously fast.  Like I said, it was probably overkill for the amount of data we have.... but it can never be too fast. speed is most definitely a feature.&lt;br /&gt;&lt;br /&gt;Anyway, that's the wrap-up.&lt;br /&gt;&lt;br /&gt;I'd be interested to other thoughts/experiences on mongodb from anyone out there.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8544296362130877194?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8544296362130877194/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8544296362130877194' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8544296362130877194'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8544296362130877194'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/05/hackday-dependency-searching-using.html' title='Hackday: dependency searching using scala, jersey, gxp, mongodb'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7613337732835165569</id><published>2010-05-05T12:06:00.000-07:00</published><updated>2010-05-07T05:31:03.086-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gslb'/><category scheme='http://www.blogger.com/atom/ns#' term='dynect'/><category scheme='http://www.blogger.com/atom/ns#' term='ec2'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><title type='text'>Improving Global Application Performance, continued: GSLB with EC2</title><content type='html'>This is an unofficial continuation of Amazon's &lt;a href="http://aws.typepad.com/aws/2010/05/improving-global-application-performance.html"&gt;blog post&lt;/a&gt; on the use of Amazon CloudFront to improve application performance.&lt;br /&gt;&lt;br /&gt;CloudFront is a great CDN to consider, especially if you're already an Amazon Web Services customer.  Unfortunately, it can only be used for static content; the loading of dynamic content will still be slower for far-away users than for nearby ones.  Simply put, users in India will still see a half-second delay when loading the dynamic portions of your US-based website.  And a half-second delay has a &lt;a href="http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html"&gt;measurable impact on revenue&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Let's talk about speeding up dynamic content, globally.&lt;br /&gt;&lt;br /&gt;The typical EC2 implementation comprises instances deployed in a single region.  Such a deployment may span several availability zones for redundancy, but all instances are in roughly the same place, geographically.&lt;br /&gt;&lt;br /&gt;This is fine for EC2-hosted apps with nominal revenue or a highly localized user base.  But what if your users are spread around the globe?  The problem can't be solved by moving your application to another region - that would simply shift the extra latency to another group.&lt;br /&gt;&lt;br /&gt;For a distributed audience, you need a distributed infrastructure.  But you can't simply launch servers around the world and expect traffic to reach them.  Enter Global Server Load Balancing (GSLB).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;big&gt;A primer on GSLB&lt;/big&gt;&lt;/span&gt;&lt;br /&gt;Broadly, GSLB is used to intelligently distribute traffic across multiple datacenters based on some set of rules.&lt;br /&gt;&lt;br /&gt;With GSLB, your traffic distribution can go from this:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_gVMijqvWhKs/S-HN4FrRiuI/AAAAAAAAARo/iY0WCb3vEas/s1600/map-without_gslb.gif"&gt;&lt;img style="margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 190px;" src="http://1.bp.blogspot.com/_gVMijqvWhKs/S-HN4FrRiuI/AAAAAAAAARo/iY0WCb3vEas/s400/map-without_gslb.gif" border="0" alt="" id="BLOGGER_PHOTO_ID_5467877786084543202" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;To this:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_gVMijqvWhKs/S-HN7dDWszI/AAAAAAAAARw/7TIf5k5jZxU/s1600/map-with_gslb.gif"&gt;&lt;img style="margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 190px;" src="http://3.bp.blogspot.com/_gVMijqvWhKs/S-HN7dDWszI/AAAAAAAAARw/7TIf5k5jZxU/s400/map-with_gslb.gif" border="0" alt="" id="BLOGGER_PHOTO_ID_5467877843899167538" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;GSLB can be implemented as a feature of a physical device (including certain high-end load balancers) or as a part of a DNS service.  Since we EC2 users are clearly not interested in hardware, our focus is on the latter: DNS-based GSLB.&lt;br /&gt;&lt;br /&gt;Standard DNS behavior is for an authoritative nameserver to, given queries for a certain record, always return the same result.  A DNS-based implementation of GSLB would alter this behavior so that queries return context-dependent results.&lt;br /&gt;&lt;br /&gt;Example:&lt;br /&gt;User A queries DNS for gslb.example.com -- response: 10.1.0.1&lt;br /&gt;User B queries DNS for gslb.example.com -- response: 10.2.0.1&lt;br /&gt;&lt;br /&gt;But what context should we use?  Since our goal is to reduce wire latency, we should route users to the closest datacenter.  IP blocks can be mapped geographically -- by examining a requestor's IP address, a GSLB service can return a geo-targeted response.&lt;br /&gt;&lt;br /&gt;With geo-targeted DNS, our example would be:&lt;br /&gt;User A (in China) queries DNS for geo.example.com -- response: 10.1.0.1&lt;br /&gt;User B (in Spain) queries DNS for geo.example.com -- response: 10.2.0.1&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;big&gt;Getting started&lt;/big&gt;&lt;/span&gt;&lt;br /&gt;At a high level, implementation can be broken down into two steps&lt;br /&gt;1) Deploying infrastructure in other AWS regions&lt;br /&gt;2) Configuring GSLB-capable DNS&lt;br /&gt;&lt;br /&gt;Infrastructure configurations will vary from shop to shop, but as an example, a read-heavy EC2 application with a single master database for writes should:&lt;br /&gt;- deploy application servers to all regions&lt;br /&gt;- deploy read-only (slave) database servers and/or read caches to all regions&lt;br /&gt;- configure application servers to use the slave database servers and/or read caches in their region for reads &lt;br /&gt;- configure application servers to use the single master in the "main" region for writes&lt;br /&gt;&lt;br /&gt;This is what such an environment would look like:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_gVMijqvWhKs/S-HJ_rIV6ZI/AAAAAAAAARQ/IlxW_cUhvno/s1600/gslb-architecture.gif"&gt;&lt;img style="margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 322px;" src="http://3.bp.blogspot.com/_gVMijqvWhKs/S-HJ_rIV6ZI/AAAAAAAAARQ/IlxW_cUhvno/s400/gslb-architecture.gif" border="0" alt="" id="BLOGGER_PHOTO_ID_5467873518351149458" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;When configuring servers to communicate across regions (app servers -&gt; master DB; slave DBs -&gt; master DB), you will need to use IP-based rules for your security groups; traffic from the "app-servers" security group you set up in eu-west-1 is indistinguishable from other traffic to your DB server in us-east-1.  This is because cross-region communication is done using external IP addresses.  Your best bet is to either automate security group updates or use Elastic IPs.&lt;br /&gt;&lt;br /&gt;Note on more complex configurations: distributed backends are hard (see &lt;a href="http://en.wikipedia.org/wiki/CAP_theorem"&gt;Brewer's [CAP] theorem&lt;/a&gt;).  Multi-region EC2 environments are much easier to implement if your application tolerates the use of 1) regional caches for reads; 2) centralized writes.  If you have a choice, stick with the simpler route.&lt;br /&gt;&lt;br /&gt;As for configuring DNS, several companies have DNS-based GSLB service offerings:&lt;br /&gt;- &lt;a href="http://dyn.com/dynect"&gt;Dynect&lt;/a&gt; - &lt;a href="http://dyn.com/dynect-traffic-management"&gt;Traffic Management&lt;/a&gt; (A records only) and &lt;a href="http://dyn.com/dynect-cdn-manager"&gt;CDN Manager&lt;/a&gt; (CNAMEs allowed)&lt;br /&gt;- &lt;a href="http://www.akamai.com/"&gt;Akamai&lt;/a&gt; - &lt;a href="http://www.akamai.com/html/technology/products/gtm.html"&gt;Global Traffic Management&lt;/a&gt;&lt;br /&gt;- &lt;a href="http://www.ultradns.com/"&gt;UltraDNS&lt;/a&gt; - &lt;a href="http://www.ultradns.com/solutions/directionaldns.html"&gt;Directional DNS&lt;/a&gt;&lt;br /&gt;- &lt;a href="http://comwired.com/"&gt;Comwired&lt;/a&gt;/&lt;a href="http://www.dns.com/"&gt;DNS.com&lt;/a&gt; - &lt;a href="http://www.dns.com/location/"&gt;Location Geo-Targeting&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;DNS configuration should be pretty similar for the vendors listed above.  Basic steps are:&lt;br /&gt;1) set up regional CNAMEs (us-east-1.example.com, us-west-1.example.com, eu-west-1.example.com, ap-southeast-1.example.com)&lt;br /&gt;2) set up a GSLB-enabled "master" CNAME (www.example.com)&lt;br /&gt;3) define the GSLB rules:&lt;br /&gt;  - For users in Asia, return ap-southeast-1.example.com&lt;br /&gt;  - For users in Europe, return eu-west-1.example.com&lt;br /&gt;  - For users in Western US, return us-west-1.example.com&lt;br /&gt;  - ...&lt;br /&gt;  - For all other users, return us-east-1.example.com&lt;br /&gt;&lt;br /&gt;If your application is already live, consider abstracting the DNS records by one layer: geo.example.com (master record); us-east-1.geo.example.com, us-west-1.geo.example.com, etc. (regional records).  Bring the new configuration live by pointing www.example.com (CNAME) to geo.example.com.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;big&gt;Bizo's experiences&lt;/big&gt;&lt;/span&gt;&lt;br /&gt;Several of our EC2 applications serve embedded content for customer websites, so it's critical we minimize load times.  Here's the difference we saw on one app after expanding into new regions (from us-east-1 to us-east-1, us-west-1, and eu-west-1) and implementing GSLB (load times provided by &lt;a href="http://browsermob.com/"&gt;BrowserMob&lt;/a&gt;):&lt;br /&gt;&lt;br /&gt;Load times before GSLB:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_gVMijqvWhKs/S-HlxnLyBDI/AAAAAAAAAR4/-bln2h6l8DA/s1600/gslb-response-after.png"&gt;&lt;img style="margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 207px;" src="http://4.bp.blogspot.com/_gVMijqvWhKs/S-HlxnLyBDI/AAAAAAAAAR4/-bln2h6l8DA/s400/gslb-response-after.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5467904063099241522" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Load times after GSLB:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_gVMijqvWhKs/S-Hl4pzvbDI/AAAAAAAAASA/Q5UDNHfsCt0/s1600/gslb-response-before.png"&gt;&lt;img style="margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 209px;" src="http://1.bp.blogspot.com/_gVMijqvWhKs/S-Hl4pzvbDI/AAAAAAAAASA/Q5UDNHfsCt0/s400/gslb-response-before.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5467904184062798898" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Reduced load times for everyone far from us-east-1.  Users are happy, customers are happy, we're happy.  Overall, a success.&lt;br /&gt;&lt;br /&gt;It's interesting to see how the load is distributed throughout the day.  Here's one application's HTTP traffic, broken down by region (ELB stats graphed by &lt;a href="http://code.google.com/p/cloudviz/"&gt;cloudviz&lt;/a&gt;):&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_gVMijqvWhKs/S-HmVJqLWBI/AAAAAAAAASI/cctK23Kh4qA/s1600/gslb-stats.png"&gt;&lt;img style="margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 104px;" src="http://3.bp.blogspot.com/_gVMijqvWhKs/S-HmVJqLWBI/AAAAAAAAASI/cctK23Kh4qA/s400/gslb-stats.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5467904673648957458" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Note that the use of Elastic Load Balancers and Auto Scaling becomes much more compelling with GSLB.  By geographically partitioning users, peak hours are much more localized.  This results in a wider difference between peak and trough demand per region; Auto Scaling adjusts capacity transparently, reducing the marginal cost of expanding your infrastructure to multiple AWS regions.&lt;br /&gt;&lt;br /&gt;For our GSLB DNS service, we use Dynect and couldn't be more pleased.  Intuitive management interface, responsive and helpful support, friendly, no-BS sales.  Pricing is based on number of GSLB-enabled domains and DNS query rate.  Contact &lt;a href="http://dyn.com/dynectsales"&gt;Dynect sales&lt;/a&gt; if you want specifics (we work with &lt;a href="http://twitter.com/jadelisle"&gt;Josh Delisle&lt;/a&gt; and &lt;a href="http://twitter.com/kyork20"&gt;Kyle York&lt;/a&gt; - great guys).  Note that those intending to use GSLB with Elastic Load Balancers will need the CDN Management service.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;big&gt;Closing remarks&lt;/big&gt;&lt;/span&gt;&lt;br /&gt;Previously, operating a global infrastructure required significant overhead.  This is where AWS really shines.  Amazon now has four regions spread across three continents, and there's minimal overhead to distribute your platform across all of them.  You just need to add a layer to route users to the closest one.&lt;br /&gt;&lt;br /&gt;The use of Amazon CloudFront in conjunction with a global EC2 infrastructure is a killer combo for improving application performance.  And with Amazon continually expanding with new AWS regions, it's only going to get better.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://twitter.com/mikebabineau"&gt;@mikebabineau&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7613337732835165569?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7613337732835165569/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7613337732835165569' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7613337732835165569'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7613337732835165569'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/05/improving-global-application.html' title='Improving Global Application Performance, continued: GSLB with EC2'/><author><name>Mike Babineau</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_gVMijqvWhKs/S5AaRFCb15I/AAAAAAAAAN8/L3FurYxZG7A/S220/mike1.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_gVMijqvWhKs/S-HN4FrRiuI/AAAAAAAAARo/iY0WCb3vEas/s72-c/map-without_gslb.gif' height='72' width='72'/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8776843409519349473</id><published>2010-04-02T19:22:00.000-07:00</published><updated>2010-04-02T19:32:17.799-07:00</updated><title type='text'>GWT MVP Tables</title><content type='html'>I'm building a GWT MVP app here at Bizo. Despite its upsides (primarily insanely fast unit test), GWT MVP is new, under documented, and, for me anyway, has had a non-trivial learning curve.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So, in the interest of sharing tricks I've learned, one I've come across is how to elegantly implement dashboard-style tables.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Dashboard-style tables are those where each row has a lot of interactivity, e.g. a user can start/stop/cancel the entity in the row, see the icons change, and other various UI things that you generally want to have under test in presenters.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For awhile, I was stumped because most GWT MVP tables favor data-style tables, where the focus is more on efficiently displaying 100s/1000s of rows of basically non-interactive data. There are a few counter examples, but none I had seen got to the point of pushing testable HasXxx interfaces back into a presenter for business logic to hook up to.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I finally got a suitable solution by realizing that each row is essentially its own view, and so instantiating a new RowXxxPresenter for each domain object in the value. This means there is a RowXxxView, which can expose all the HasXxx interfaces it wants to the presenter. And the RowXxxPresenterTest becomes a treat to write.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For more details and code examples, please &lt;a href="http://www.draconianoverlord.com/2010/03/31/gwt-mvp-tables.html"&gt;jump over&lt;/a&gt; to the more elaborate post over on my personal blog.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8776843409519349473?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8776843409519349473/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8776843409519349473' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8776843409519349473'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8776843409519349473'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/04/gwt-mvp-tables.html' title='GWT MVP Tables'/><author><name>Stephen Haberman</name><uri>http://www.blogger.com/profile/05412274950722949930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-2636769230713215111</id><published>2010-03-24T10:10:00.001-07:00</published><updated>2010-03-24T10:13:13.811-07:00</updated><title type='text'>Bizo Job - Designer</title><content type='html'>&lt;p class="western" style="font-family:Arial;margin-left:6in;text-align:left"&gt;&lt;/p&gt;&lt;p class="western"  style="font-family:Arial;"&gt;&lt;/p&gt;&lt;div id="co6v" style="font-family:Arial;text-align:left"&gt;&lt;img src="http://docs.google.com/a/bizo.com/File?id=dgbsf555_11frz9rwhn_b" style="height:37px;width:75px" /&gt;&lt;br /&gt;&lt;/div&gt;&lt;p class="western" face="Arial"&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class="western" face="Arial"&gt;&lt;span style="font-family:arial;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Position: &lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Web / UI / UX Designer (San Francisco)&lt;/span&gt;&lt;/span&gt; &lt;/p&gt;&lt;p class="western" face="Arial"&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class="western" style="font-family:Arial"&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-size:85%;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Summary:&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color:#000000;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;We’re looking for an out-of-the-box thinker with a good sense-of-humor and a great attitude to join our product development team. As the first in-house Web / UI / UX Designer for Bizo, you will take responsibility for developing easy, powerful, consistent and high velocity web and interaction designs across all Bizo web products as well as marketing materials related to the Bizographic Targeting Platform, a revolutionary new way to target business advertising online. You will be a key player on an incredible team as we build our world-beating, game-changing, and massively scalable bizographic advertising and targeting platform. &lt;/span&gt;&lt;u&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;In a nutshell, you will be the voice of reason in all design and usability aspects of Bizo.&lt;/span&gt;&lt;/i&gt;&lt;/u&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class="western" style="font-family:Arial"&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="color:#000000;"&gt;&lt;span style="font-size:85%;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;The Team:&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;We’re a small team of very talented people (if we do say so ourselves!).  We use Agile development methodologies. We care about high quality results, not how many hours you’re in the office. We believe in strong design that helps people g&lt;/span&gt;&lt;u&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;et stuff done&lt;/span&gt;&lt;/i&gt;&lt;/u&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;The Ideal Candidate:&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;ul face="Arial"&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Self-motivated&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Entrepreneurial / Hacker spirit&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Experience/Expertise with Adobe Illustrator (and/or similar design tools)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Strong CSS skills&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Strong HTML skills &lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Strong Javascript skills&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Understands the value of mock-ups (points for Balsamiq experience)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Flash experience (bonus but not required)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Enjoys working on teams&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Educational background or industry experience in Design or related field – points for advanced degrees&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;u&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Gets stuff done!&lt;/span&gt;&lt;/u&gt;&lt;/i&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p class="western" style="font-family:Arial;margin-left:0.75in"&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class="western" style="font-family:Arial"&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:arial;"&gt;&lt;b&gt;Please send a resume, cover letter and link to online portfolio to: &lt;a href="mailto:donnie@bizo.com" id="dbdw" title="donnie@bizo.com"&gt;donnie@bizo.com&lt;/a&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-2636769230713215111?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/2636769230713215111/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=2636769230713215111' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2636769230713215111'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2636769230713215111'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/03/bizo-job-description-designer.html' title='Bizo Job - Designer'/><author><name>Donnie</name><uri>http://www.blogger.com/profile/13599133732419522440</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-6092312981731082407</id><published>2010-03-17T13:31:00.001-07:00</published><updated>2010-03-17T13:31:26.957-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cloudwatch'/><category scheme='http://www.blogger.com/atom/ns#' term='Google Visualizations'/><category scheme='http://www.blogger.com/atom/ns#' term='cloudviz'/><category scheme='http://www.blogger.com/atom/ns#' term='boto'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><title type='text'>Introducing Cloudviz</title><content type='html'>&lt;div&gt;Amazon CloudWatch exposes a variety of useful metrics for EC2 instances, Elastic Load Balancers, and more.  Unfortunately, it is tedious to query directly and the results can be difficult to interpret.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;Like most operational metrics, CloudWatch data provides the most insight when graphed.  While there are existing tools to graph CloudWatch data, they are only available as part of a proprietary suite or service and, generally, they sacrifice customization and flexibility for ease-of-use.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;Here at Bizo, we wanted to incorporate CloudWatch data into operational dashboards.  Nothing we found was flexible enough to meet our needs, so we decided to write our own.  We are now releasing it to for all to use.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;I'm pleased to introduce &lt;a href="http://github.com/mbabineau/cloudviz"&gt;cloudviz&lt;/a&gt;, an open source tool for creating embeddable CloudWatch graphs.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;Specifically, cloudviz is a data source that exposes CloudWatch data for graphing by Google &lt;a href="http://code.google.com/apis/visualization/interactive_charts.html"&gt;Interactive Charts&lt;/a&gt; (formerly Visualization API).  It's written in Python using Google's &lt;a href="http://code.google.com/apis/visualization/documentation/dev/gviz_api_lib.html"&gt;Data Source library&lt;/a&gt; and Mitch Garnaat's excellent AWS interface, &lt;a href="http://code.google.com/p/boto"&gt;boto&lt;/a&gt;.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;With cloudviz, it's easy to create graphs like these:&lt;br&gt;&lt;br /&gt;&lt;a href="http://mbabineau.github.com/cloudviz/example-elb-requestcount.png"&gt;&lt;img src="http://mbabineau.github.com/cloudviz/example-elb-requestcount.png" style="text-align: left;display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; cursor: pointer; width: 400px;" border="0" alt="" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://mbabineau.github.com/cloudviz/example-hosts-cpu.png"&gt;&lt;img src="http://mbabineau.github.com/cloudviz/example-hosts-cpu.png" style="text-align: left;display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; cursor: pointer; width: 400px;" border="0" alt="" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;I encourage you to check out the project on GitHub &lt;a href="http://github.com/mbabineau/cloudviz"&gt;here&lt;/a&gt;.  There's a fairly detailed README and plenty of examples, but feel free to drop me a line if you have any questions, &lt;a href="mailto:mike@bizo.com"&gt;mike@bizo.com&lt;/a&gt;.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;Happy graphing!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-6092312981731082407?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/6092312981731082407/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=6092312981731082407' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6092312981731082407'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6092312981731082407'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/03/introducing-cloudviz.html' title='Introducing Cloudviz'/><author><name>Mike Babineau</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_gVMijqvWhKs/S5AaRFCb15I/AAAAAAAAAN8/L3FurYxZG7A/S220/mike1.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7112311583419550759</id><published>2010-03-05T11:01:00.000-08:00</published><updated>2010-03-05T11:14:45.714-08:00</updated><title type='text'>SSH to EC2 instance ID</title><content type='html'>&lt;span class="Apple-style-span"   style="  color: rgb(68, 68, 68); line-height: 18px; font-family:Arial, Helvetica, sans-serif;font-size:12px;"&gt;&lt;p style="margin-top: 15px; margin-right: 0px; margin-bottom: 15px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; line-height: 1.5em; "&gt;I often find myself looking up EC2 nodes by instance ID so I can grab the external DNS name and SSH in. Fed up with the extra “ec2-describe-instance , copy, paste” layer, I threw together a function (basically a fancy alias) to SSH into an EC2 instance referenced by ID.&lt;/p&gt;&lt;p style="margin-top: 15px; margin-right: 0px; margin-bottom: 15px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; line-height: 1.5em; "&gt;Assuming you’re on Mac OS X / Linux, just put &lt;a href="http://gist.github.com/319882#file_ssh_instance_function.sh"&gt;this&lt;/a&gt; somewhere in ~/.profile, reload your terminal, and you’re good to go.&lt;/p&gt;&lt;p style="margin-top: 15px; margin-right: 0px; margin-bottom: 15px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; line-height: 1.5em; "&gt;Alternatively, you can use the &lt;a href="http://gist.github.com/319882#file_ssh_instance.sh"&gt;shell script version&lt;/a&gt;.&lt;/p&gt;&lt;br /&gt;&lt;script src="http://gist.github.com/319882.js?file=ssh-instance-function.sh"&gt;&lt;/script&gt;&lt;p&gt;&lt;/p&gt;&lt;p style="margin-top: 15px; margin-right: 0px; margin-bottom: 15px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "&gt;(note: cross-posted &lt;a href="http://mikebabineau.posterous.com/ssh-to-ec2-instance-id"&gt;here&lt;/a&gt;)&lt;/p&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7112311583419550759?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7112311583419550759/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7112311583419550759' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7112311583419550759'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7112311583419550759'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/03/ssh-to-ec2-instance-id.html' title='SSH to EC2 instance ID'/><author><name>Mike Babineau</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_gVMijqvWhKs/S5AaRFCb15I/AAAAAAAAAN8/L3FurYxZG7A/S220/mike1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-4205551558860765259</id><published>2010-03-04T11:45:00.000-08:00</published><updated>2010-03-04T12:35:47.841-08:00</updated><title type='text'>Example git/git-sh config</title><content type='html'>I've been using git, git-svn, and git-sh while working on Bizo's internal projects and really enjoying it. Per requests from some other devs, here is my git/git-sh config.&lt;br /&gt;&lt;br /&gt;First, you should start with &lt;a href="http://github.com/rtomayko/git-sh"&gt;git-sh&lt;/a&gt;. It adds some bash shell customizations like a nice `PS1` prompt, tab completion, and incredibly short git-specific aliases. I'll cover some of the aliases later, but this is the thing that started me down the "how cool can I get my git environment" path.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;I've included commented versions of my .gitconfig and .gitshrc below, but you can find raw versions &lt;a href="http://www.draconianoverlord.com/files/gitconfig"&gt;here&lt;/a&gt; and &lt;a href="http://www.draconianoverlord.com/files/gitshrc"&gt;here&lt;/a&gt;. I also cross-posted this on my &lt;a href="http://www.draconianoverlord.com/2010/03/04/git-config.html"&gt;personal&lt;/a&gt; blog if you're so inclined as to read it twice.&lt;br /&gt;&lt;h3&gt;Example Shell Session&lt;/h3&gt;&lt;br /&gt;&lt;br /&gt;A lot of my customizations are around aliases, so this is a quick overview, and then the aliases are defined/explained below.&lt;br /&gt;&lt;br /&gt;Here is a made up example bash session with some of the commands:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;  # show we're in a basic java/whatever project&lt;br /&gt;  $ ls&lt;br /&gt;  src/ tests/&lt;br /&gt;&lt;br /&gt;  # start git-sh to get into a git-specific bash environment&lt;br /&gt;  $ git sh&lt;br /&gt;&lt;br /&gt;  # change some things&lt;br /&gt;  $ echo "file1" &gt; src/package1/file1&lt;br /&gt;  $ echo "file2" &gt; src/package2/file2&lt;br /&gt;  $ echo "file3" &gt; src/package3/file2&lt;br /&gt;&lt;br /&gt;  # see all of our changes&lt;br /&gt;  $ d&lt;br /&gt;  # runs: git diff&lt;br /&gt;&lt;br /&gt;  # see only the changes in package1&lt;br /&gt;  $ dg package1&lt;br /&gt;  # runs: git diff src/package1/file1&lt;br /&gt;&lt;br /&gt;  # stage any path with 'package' in it&lt;br /&gt;  $ ag package&lt;br /&gt;  # runs: git add src/package1/file1 src/package2/file2 src/package3/file3&lt;br /&gt;&lt;br /&gt;  # we only wanted package1, reset package2 and package3&lt;br /&gt;  $ rsg package2&lt;br /&gt;  # runs: git reset src/package2/file2&lt;br /&gt;  $ rsg package3&lt;br /&gt;  # runs: git reset src/package3/file3&lt;br /&gt;&lt;br /&gt;  # see what we have staged now (only package1)&lt;br /&gt;  $ p&lt;br /&gt;  # runs: git diff --cached&lt;br /&gt;&lt;br /&gt;  # commit it&lt;br /&gt;  $ commit -m "Changed stuff in package1"&lt;br /&gt;  # runs: git commit -m "..."&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;That is the basic idea.&lt;br /&gt;&lt;br /&gt;Most of the magic is from the [alias] section of .gitconfig, along with my .gitshrc allowing the git prefix to be dropped.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;.gitconfig&lt;/h3&gt;&lt;br /&gt;&lt;br /&gt;The .gitconfig file is in your home directory and is for user-wide settings.&lt;br /&gt;&lt;br /&gt;Here is my current .gitconfig with comments:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;  [user]&lt;br /&gt;    name = Stephen Haberman&lt;br /&gt;    email = stephen@exigencecorp.com&lt;br /&gt;  [alias]&lt;br /&gt;    # 'add all' stages all new+changed+deleted files&lt;br /&gt;    aa = !git ls-files -d | xargs -r git rm &amp;amp;&amp;amp; git ls-files -m -o --exclude-standard | xargs -r git add&lt;br /&gt;&lt;br /&gt;    # 'add grep' stages all new+changed that match $1&lt;br /&gt;    ag = "!sh -c 'git ls-files -m -o --exclude-standard | grep $1 | xargs -r git add' -"&lt;br /&gt;&lt;br /&gt;    # 'checkout grep' checkouts any files that match $1&lt;br /&gt;    cg = "!sh -c 'git ls-files -m | grep $1 | xargs -r git checkout' -"&lt;br /&gt;&lt;br /&gt;    # 'diff grep' diffs any files that match $1&lt;br /&gt;    dg = "!sh -c 'git ls-files -m | grep $1 | xargs -r git diff' -"&lt;br /&gt;&lt;br /&gt;    # 'patch grep' diff --cached any files that match $1&lt;br /&gt;    pg = "!sh -c 'git ls-files -m | grep $1 | xargs -r git diff --cached' -"&lt;br /&gt;&lt;br /&gt;    # 'remove grep' remove any files that match $1&lt;br /&gt;    rmg = "!sh -c 'git ls-files -d | grep $1 | xargs -r git rm' -"&lt;br /&gt;&lt;br /&gt;    # 'reset grep' reset any files that match $1&lt;br /&gt;    rsg = "!sh -c 'git ls-files -c | grep $1 | xargs -r git reset' -"&lt;br /&gt;&lt;br /&gt;    # nice log output&lt;br /&gt;    lg = log --graph --pretty=oneline --abbrev-commit --decorate&lt;br /&gt;&lt;br /&gt;    # rerun svn show-ignore -&gt; exclude&lt;br /&gt;    si = !git svn show-ignore &gt; .git/info/exclude&lt;br /&gt;&lt;br /&gt;    # start git-sh&lt;br /&gt;    sh = !git-sh&lt;br /&gt;  [color]&lt;br /&gt;    # turn on color&lt;br /&gt;    diff = auto&lt;br /&gt;    status = auto&lt;br /&gt;    branch = auto&lt;br /&gt;    interactive = auto&lt;br /&gt;    ui = auto&lt;br /&gt;  [color "branch"]&lt;br /&gt;    # good looking colors i copy/pasted from somewhere&lt;br /&gt;    current = green bold&lt;br /&gt;    local = green&lt;br /&gt;    remote = red bold&lt;br /&gt;  [color "diff"]&lt;br /&gt;    # good looking colors i copy/pasted from somewhere&lt;br /&gt;    meta = yellow bold&lt;br /&gt;    frag = magenta bold&lt;br /&gt;    old = red bold&lt;br /&gt;    new = green bold&lt;br /&gt;  [color "status"]&lt;br /&gt;    # good looking colors i copy/pasted from somewhere&lt;br /&gt;    added = green bold&lt;br /&gt;    changed = yellow bold&lt;br /&gt;    untracked = red&lt;br /&gt;  [color "sh"]&lt;br /&gt;    branch = yellow&lt;br /&gt;  [core]&lt;br /&gt;    excludesfile = /home/stephen/.gitignore&lt;br /&gt;    # two-space tabs&lt;br /&gt;    pager = less -FXRS -x2&lt;br /&gt;  [push]&lt;br /&gt;    # 'git push' should only do the current branch, not all&lt;br /&gt;    default = current&lt;br /&gt;  [branch]&lt;br /&gt;    # always setup 'git pull' to rebase instead of merge&lt;br /&gt;    autosetuprebase = always&lt;br /&gt;  [diff]&lt;br /&gt;    renames = copies&lt;br /&gt;    mnemonicprefix = true&lt;br /&gt;  [svn]&lt;br /&gt;    # push empty directory removals back to svn at directory deletes&lt;br /&gt;    rmdir = true&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;.gitshrc&lt;/h3&gt;&lt;br /&gt;&lt;br /&gt;This is my .gitshrc file, heavily based off Ryan Tomayko's original.&lt;br /&gt;&lt;br /&gt;Ryan's original comments are prefixed with #, I'll prefix my additions with ###, most of which are aliases to my [alias] entries above and some git-svn aliases.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;code&gt;&lt;br /&gt;  #!/bin/bash&lt;br /&gt;  # rtomayko's ~/.gitshrc file&lt;br /&gt;  ### With additions from stephenh&lt;br /&gt;&lt;br /&gt;  # git commit&lt;br /&gt;  gitalias commit='git commit --verbose'&lt;br /&gt;  gitalias amend='git commit --verbose --amend'&lt;br /&gt;  gitalias ci='git commit --verbose'&lt;br /&gt;  gitalias ca='git commit --verbose --all'&lt;br /&gt;  gitalias  n='git commit --verbose --amend'&lt;br /&gt;&lt;br /&gt;  # git branch and remote&lt;br /&gt;  gitalias  b='git branch -av' ### Added -av parameter&lt;br /&gt;  gitalias rv='git remote -v'&lt;br /&gt;&lt;br /&gt;  # git add&lt;br /&gt;  gitalias  a='git add'&lt;br /&gt;  gitalias au='git add --update'&lt;br /&gt;  gitalias ap='git add --patch'&lt;br /&gt;  ### Added entries for my .gitconfig aliases&lt;br /&gt;  alias aa='git aa' # add all updated/new/deleted&lt;br /&gt;  alias ag='git ag' # add with grep&lt;br /&gt;  alias agp='git agp' # add with grep -p&lt;br /&gt;  alias cg='git cg' # checkout with grep&lt;br /&gt;  alias dg='git dg' # diff with grep&lt;br /&gt;  alias pg='git pg' # patch with grep&lt;br /&gt;  alias rsg='git rsg' # reset with grep&lt;br /&gt;  alias rmg='git rmg' # remove with grep&lt;br /&gt;&lt;br /&gt;  # git checkout&lt;br /&gt;  gitalias c='git checkout'&lt;br /&gt;&lt;br /&gt;  # git fetch&lt;br /&gt;  gitalias f='git fetch'&lt;br /&gt;&lt;br /&gt;  # basic interactive rebase of last 10 commits&lt;br /&gt;  gitalias r='git rebase --interactive HEAD~10'&lt;br /&gt;  alias cont='git rebase --continue'&lt;br /&gt;&lt;br /&gt;  # git diff&lt;br /&gt;  gitalias d='git diff'&lt;br /&gt;  gitalias p='git diff --cached'   # mnemonic: "patch"&lt;br /&gt;&lt;br /&gt;  # git ls-files&lt;br /&gt;  ### Added o to list other files that aren't ignored&lt;br /&gt;  gitalias o='git ls-files -o --exclude-standard'    # "other"&lt;br /&gt;&lt;br /&gt;  # git status&lt;br /&gt;  alias  s='git status'&lt;br /&gt;&lt;br /&gt;  # git log&lt;br /&gt;  gitalias  L='git log'&lt;br /&gt;  # gitalias l='git log --graph --pretty=oneline --abbrev-commit --decorate'&lt;br /&gt;  gitalias  l="git log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr)%Creset' --abbrev-commit --date=relative"&lt;br /&gt;  gitalias ll='git log --pretty=oneline --abbrev-commit --max-count=15'&lt;br /&gt;&lt;br /&gt;  # misc&lt;br /&gt;  gitalias pick='git cherry-pick'&lt;br /&gt;&lt;br /&gt;  # experimental&lt;br /&gt;  gitalias mirror='git reset --hard'&lt;br /&gt;  gitalias stage='git add'&lt;br /&gt;  gitalias unstage='git reset HEAD'&lt;br /&gt;  gitalias pop='git reset --soft HEAD^'&lt;br /&gt;  gitalias review='git log -p --max-count=1'&lt;br /&gt;&lt;br /&gt;  ### Added git svn asliases&lt;br /&gt;  gitalias si='git si' # update svn ignore &gt; exclude&lt;br /&gt;  gitalias sr='git svn rebase'&lt;br /&gt;  gitalias sp='git svn dcommit'&lt;br /&gt;  gitalias sf='git svn fetch'&lt;br /&gt;&lt;br /&gt;  ### Added call to git-wtf tool&lt;br /&gt;  gitalias wtf='git-wtf'&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Since I defined most of the interesting aliases in the .gitconfig [alias] section, it means they're all usable via git xxx, e.g. git ag foo, but listing alias ag='git ag' in .gitshrc means you can also just use ag foo, assuming you've started the git-sh environment.&lt;br /&gt;&lt;br /&gt;It results in some duplication, but means they're usable from both inside and outside of git-sh, which I think is useful.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-4205551558860765259?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/4205551558860765259/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=4205551558860765259' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4205551558860765259'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4205551558860765259'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/03/example-gitgit-sh-config.html' title='Example git/git-sh config'/><author><name>Stephen Haberman</name><uri>http://www.blogger.com/profile/05412274950722949930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-2659989623579807466</id><published>2010-03-04T01:44:00.000-08:00</published><updated>2010-03-04T02:04:14.503-08:00</updated><title type='text'>Get Your Speed Tracer On!</title><content type='html'>I first saw Speed Tracer in action at Google I/O 2009 and was pretty amped about it.  While we have been using GWT 2.0 features for a few months now (e.g. OOPHM, UiBinder, ClientBundle), I had not tried out Speed Tracer until tonight.  Speed Tracer is a Chrome plugin that is as a web performance profiling tool on steroids.  The level of profiling information that you can get from this tool is truly amazing.  If you develop web apps then I highly recommend that you check it out.  I guarantee it will be something you will want to have in your toolbox.  Installation instructions can be found &lt;a href="http://code.google.com/webtoolkit/speedtracer/get-started.html"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-2659989623579807466?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/2659989623579807466/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=2659989623579807466' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2659989623579807466'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2659989623579807466'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/03/get-your-speed-tracer-on.html' title='Get Your Speed Tracer On!'/><author><name>Timo</name><uri>http://www.blogger.com/profile/05949421779840031276</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-2778603528924817451</id><published>2010-02-19T11:40:00.000-08:00</published><updated>2010-02-19T12:53:32.240-08:00</updated><title type='text'>Triggering post-Elastic MapReduce steps as parameterized jobs in Hudson</title><content type='html'>Here at Bizo, the combination of &lt;a href="http://dev.bizo.com/2009/11/using-hudson-to-manage-crons.html"&gt;Hudson for cron management&lt;/a&gt;, &lt;a href="http://dev.bizo.com/search?q=hive"&gt;Hive&lt;/a&gt; for report generation, and Elastic MapReduce for provisioning compute power has greatly simplified our data processing.  Periodically and automatically, our Hudson cron instance generates Hive scripts for us and launches them in EC2.&lt;br /&gt;&lt;br /&gt;The main inconvenience with this process is that the results of our Hive jobs are left as one or more obscurely named files in S3.  These often need some post-processing to put them into a more friendly form.  Unfortunately, EMR doesn't have an easy hook for launching these post-processing tasks -- although we could implement them as MapReduce steps, we'd need to write our own workflows, losing the simplicity of using EMR's simple "--hive-script" flag.&lt;br /&gt;&lt;br /&gt;Our solution is to use SimpleDB to store some basic metadata about jobs.  Using this metadata, a Hudson job periodically checks the EMR API to determine whether tasks have completed.  If so, it then triggers other Hudson jobs that are responsible for processing the results.&lt;br /&gt;&lt;br /&gt;Here are some tools that make this process work:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://gist.github.com/309159"&gt;Simple script to put data into SimpleDB.&lt;/a&gt;  Our metadata scheme is to use the jobflow ID as item names and the name/parameters of the jobs to trigger as attributes.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://wiki.hudson-ci.org/display/HUDSON/Parameterized+Build"&gt;The Hudson parameterized build feature.&lt;/a&gt;  It's not really feasible to create a new Hudson job for each individual report that runs, so we pass parameters to Hudson so the post-processing step can figure out where the results are in S3 and what to do with them.  It's not well-documented how to do this programatically (as opposed to from the web interface); the solution is to send some JSON to the build url.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://gist.github.com/309173"&gt;The Trigger Script.&lt;/a&gt;  This is the script that periodically runs on our cron server to check if a post-processing step should be triggered.  The JSON format for parameterized jobs is described in the comments of this file.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;The end result is that a job can run an EMR job and configure a post-processing step for itself with the following commands:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;JOB_ID=`elastic-mapreduce --create --hive-script --arg ${s3.location}` | grep "Created job flow" | awk '{ print $4 }' -`&lt;br /&gt;&lt;br /&gt;simpledb-put.rb -d ${metadata.domain} -i $JOB_ID "next_on_cron_server_job_name=post-processing-step" "next_on_cron_server_job_params={\"parameter\": [{\"name\":\"PARAM1\", \"value\":\"VALUE1\" }]}" "next_on_cron_server_triggered=false"&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This launches the hive script in the specified s3.location and configures the "post-processing-step" job on the cron server to run with the parameter "PARAM1=VALUE1".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-2778603528924817451?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/2778603528924817451/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=2778603528924817451' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2778603528924817451'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2778603528924817451'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/02/triggering-post-elastic-mapreduce-steps.html' title='Triggering post-Elastic MapReduce steps as parameterized jobs in Hudson'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-6743434894901623948</id><published>2010-01-07T19:42:00.000-08:00</published><updated>2010-01-07T20:39:24.973-08:00</updated><title type='text'>Scala Supports Non-Local Returns</title><content type='html'>Writing some Scala code today, I found myself using non-local returns without even thinking about it. After realizing what I had done, I dug a little deeper to see what was really going on.&lt;br /&gt;&lt;br /&gt;Take this completely made up, nonsensical example:&lt;br /&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;pre&gt;&lt;br /&gt;object Foo {&lt;br /&gt;  def main(args: Array[String]) {&lt;br /&gt;    foo(List(1, 2, 3))&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  def foo(l: List[Int]): Int = {&lt;br /&gt;    l.foreach { (i) =&gt;&lt;br /&gt;      println(i)&lt;br /&gt;      return 5&lt;br /&gt;    }&lt;br /&gt;    return 10&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This code will print "1" and then exit.&lt;br /&gt;&lt;br /&gt;Perhaps this is obvious, that the "return 5" applies to the "foo" method, so the values 2 and 3 in "l" will not have a chance to be printed.&lt;br /&gt;&lt;br /&gt;However, think about what is going on under the covers--Scala is passing the foreach method an anonymous inner class with a "void apply(int i)" method. And inside of that "apply" method is the code between the "{ (i) =&gt; ... }".&lt;br /&gt;&lt;br /&gt;So, how does code inside of the "apply" method cause its caller to perform an early exit, without the "foreach" even knowing about it?&lt;br /&gt;&lt;br /&gt;Exceptions.&lt;br /&gt;&lt;br /&gt;Here is the decompiled version of "print":&lt;br /&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;pre&gt;&lt;br /&gt; public int print(List&lt;Integer&gt; l) {&lt;br /&gt;    Object localObject = new Object();&lt;br /&gt;    int exceptionResult1 = 0;&lt;br /&gt;    try {&lt;br /&gt;      l.foreach(new AbstractFunction1() {&lt;br /&gt;        public static final long serialVersionUID = 0L;&lt;br /&gt;&lt;br /&gt;        public final Nothing. apply(int i) {&lt;br /&gt;          Predef..MODULE$.println(BoxesRunTime.boxToInteger(i));&lt;br /&gt;          // here is the "return"--it puts "5" into an exception&lt;br /&gt;          throw new NonLocalReturnException(&lt;br /&gt;            this.nonLocalReturnKey1$1,&lt;br /&gt;            BoxesRunTime.boxToInteger(5));&lt;br /&gt;        }&lt;br /&gt;      });&lt;br /&gt;      return 10;&lt;br /&gt;    } catch (NonLocalReturnException localNonLocalReturnException)  {&lt;br /&gt;      if (localNonLocalReturnException.key() == localObject) {&lt;br /&gt;        // get "5" back out of the exception&lt;br /&gt;        return BoxesRunTime.unboxToInt(localNonLocalReturnException.value());&lt;br /&gt;      }&lt;br /&gt;      throw localNonLocalReturnException;&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;/pre&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;I think the decompiler got a little confused with "nonLocalReturnKey", but you can see the basic idea is that any early return inside of a closure is converted into an exception that is then caught outside of the closure where a proper return call can be done.&lt;br /&gt;&lt;br /&gt;I personally think this is handy, once you know what is going on. But from what I've picked up, any closures that make it into Java 7 will not support non-local returns and instead disallow the "return" keyword inside of closures. Which, I guess, at this point any Java closures are better than no closures at all.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-6743434894901623948?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/6743434894901623948/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=6743434894901623948' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6743434894901623948'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6743434894901623948'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2010/01/scala-supports-non-local-returns.html' title='Scala Supports Non-Local Returns'/><author><name>Stephen Haberman</name><uri>http://www.blogger.com/profile/05412274950722949930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7221753082181483487</id><published>2009-12-15T11:22:00.001-08:00</published><updated>2009-12-15T17:48:14.766-08:00</updated><title type='text'>amazon ec2 spot instances</title><content type='html'>Yesterday Amazon announced &lt;a href="http://aws.amazon.com/ec2/spot-instances/"&gt;EC2 Spot Instances&lt;/a&gt;.  The idea is that you can bid on unused EC2 instance time.  The 'Spot price' is determined periodically by Amazon based on availability and demand for the instances.  If your bid is higher than the spot price, you will get an instance and only pay the spot price.  Of course, your instance may be terminated at any time, but the nice thing is that unlike the normal ec2 pricing, here you only pay for full hours of usage.&lt;br /&gt;&lt;br /&gt;To check out the price history of small linux instances, download the new release of the &lt;a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=351&amp;categoryID=88"&gt;ec2-api-tools&lt;/a&gt; and run:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;ec2-describe-spot-price-history --instance-type m1.small -d Linux/UNIX -H&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Running this last night, I saw prices that looked like (times PST):&lt;br /&gt;&lt;br /&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/misc/images/spot-prices.jpg"&gt;&lt;br /&gt;&lt;br /&gt;It looks like there's a substantial discount here with prices ranging from $0.025 to $0.035 per hour (the normal ec2 price is $0.085/hr).&lt;br /&gt;&lt;br /&gt;Since I'm in the middle of reading &lt;a href="http://www.amazon.com/How-Cheat-Everything-Esoteric-Cheating/dp/1560259736"&gt;How to Cheat at Everything&lt;/a&gt;, one of my first thoughts was why not just bid say $0.10/hour?  In this way, you're unlikely to get outbid, but you'll probably stand to save significantly for a large part of the day.  Now I'm thinking this probably isn't quite a free market...  If amazon needs capacity to satisfy reserved instances, or even regular ec2 instances, maybe they'll just kill off these machines to make room.&lt;br /&gt;&lt;br /&gt;Still, this is really very cool. A great option for doing a lot of offline batch processing.  I hope we start to see support for taking advantage of this type of model in hadoop.  It's also exciting to think that one day maybe we'll see something like this across providers -- bid for time across amazon, sun, etc.&lt;p&gt;&lt;b&gt;Update:&lt;/b&gt; Some nice charts by Tim Lossen at &lt;a href="http://www.cloudexchange.org/"&gt;cloudexchange.org&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7221753082181483487?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7221753082181483487/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7221753082181483487' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7221753082181483487'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7221753082181483487'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/12/amazon-ec2-spot-instances.html' title='amazon ec2 spot instances'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-5316185864994172870</id><published>2009-12-04T16:17:00.001-08:00</published><updated>2009-12-04T16:17:06.358-08:00</updated><title type='text'>github spam?</title><content type='html'>I just happened to land on the &lt;a href="http://github.com/repositories"&gt;github recent repositories page&lt;/a&gt;, and noticed a ton of spam:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://github.com/repositories" border="0"&gt;&lt;img src="http://com-bizo-public.s3.amazonaws.com/blog/misc/images/gspam.png"&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;A bunch of different users and projects advertising movie downloads.  There's no project content, of course, just a "homepage" that points to a target url...&lt;br /&gt;&lt;br /&gt;At first I was thinking, wow, these are some crazy spammers -- using git as a tool for spam!  But on closer look, it seems like they're just hitting the website automating account signup and new repository actions.&lt;br /&gt;&lt;br /&gt;Still, spam on github?  Crazy!  I guess no website is safe these days.  If you're hosting user generated content, you need to think about detecting and blocking spam and automation of user activity.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-5316185864994172870?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/5316185864994172870/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=5316185864994172870' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5316185864994172870'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5316185864994172870'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/12/github-spam.html' title='github spam?'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-821529997728278985</id><published>2009-11-30T18:28:00.001-08:00</published><updated>2009-12-01T10:37:56.150-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>quick script: open hadoop jobtracker UI with elastic map reduce</title><content type='html'>If you've ever logged into the hadoop master with amazon's elastic map reduce, you'll see something like:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;The Hadoop UI can be accessed via the command: lynx http://localhost:9100/&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Great, but lynx?.. not as nice as firefox or safari...&lt;br /&gt;&lt;br /&gt;It's easy enough to do some ssh port forwarding so you can use your browser of choice and access the hadoop UI from your machine.&lt;br /&gt;&lt;br /&gt;But, after getting tired of typing in the ssh options a bunch of times, I finally put together a short script that automates it a bit.  The script takes in the public hostname of your hadoop master (you can get this from elastic-mapreduce --list), then picks a random port number, sets up the ssh forwarding, and opens the page in a new browser window.&lt;br /&gt;&lt;br /&gt;I call it hcon for 'hadoop console'.  After configuring the script with the path to your emr key file, you run it like:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;hcon ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://com-bizo-public.s3.amazonaws.com/hive/misc/hcon.sh"&gt;Here's the full script&lt;/a&gt;, but in case you're curious the magic lines (wrapped) are:&lt;br /&gt;&lt;pre class="prettyprint" style="border: none;"&gt;&lt;br /&gt;ssh -f -N -o "StrictHostKeyChecking no" \&lt;br /&gt; -L ${LPORT}:localhost:9100 \&lt;br /&gt; -i ${KEYFILE} hadoop@${HOST}&lt;br /&gt;$BROWSER http://localhost:${LPORT}&lt;/pre&gt;&lt;br /&gt;(Yes, for this, I turn off StrictHostKeyChecking).&lt;br /&gt;&lt;br /&gt;Anyway, try it out and let me know if it's helpful at all.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-821529997728278985?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/821529997728278985/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=821529997728278985' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/821529997728278985'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/821529997728278985'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/11/quick-script-open-hadoop-jobtracker-ui.html' title='quick script: open hadoop jobtracker UI with elastic map reduce'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-2625051156666521664</id><published>2009-11-13T10:08:00.000-08:00</published><updated>2009-11-13T10:11:50.689-08:00</updated><title type='text'>Amazon Web Services Start-up Challenge Finalists</title><content type='html'>We are excited to be one of nine &lt;a href="http://aws.amazon.com/about-aws/whats-new/2009/11/11/amazon-cloudfront-now-supports-private-content-2/"&gt;finalists&lt;/a&gt; in the AWS Start-up Challenge!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-2625051156666521664?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/2625051156666521664/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=2625051156666521664' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2625051156666521664'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2625051156666521664'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/11/amazon-web-services-start-up-challenge.html' title='Amazon Web Services Start-up Challenge Finalists'/><author><name>Donnie</name><uri>http://www.blogger.com/profile/13599133732419522440</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-1735243029774508336</id><published>2009-11-02T09:55:00.000-08:00</published><updated>2009-11-02T10:23:46.681-08:00</updated><title type='text'>Using Hudson to manage crons</title><content type='html'>&lt;p&gt;We've been using &lt;a href="http://hudson-ci.org/"&gt;Hudson&lt;/a&gt; for several months now to manage our builds -- we probably have 80-90 different projects that it's responsible for.  It's an awesome system for continuous integration and testing.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;It's also an awesome system for scheduling and managing generic jobs.  We've only just begun to use it as a cron server, but it's clear that it has numerous advantages over the more traditional way of using the unix cron service directly.&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Notification plugins -- Hudson can be easily configured to send email and Jabber notifications when cron jobs start, succeed, or fail.  You can also track your scheduled jobs via RSS.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Stdout/Sterr logging -- Hudson saves the stdout and stderr from each run automatically.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;SCM integration -- if you need to update a job, just check the changes into SVN (or whatever SCM system you use).  Hudson will automatically pick up the changes the next time your job is run.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Nice web interface -- never underestimate the productivity gains from having a good UI.  It can be surprisingly tricky to determine exactly which crons are running on a generic Unix box.  Not so with Hudson.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;At Bizo, we believe that developers should be getting their hands dirty in the operational aspects of their projects -- Hudson gives us an easy interface for managing our scheduled jobs using the same tools that we're familiar with for managing our build processes.  Hudson is such a great tool for continuous integration that it's easy to overlook how good it is at the simpler task of managing generic scheduled jobs.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-1735243029774508336?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/1735243029774508336/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=1735243029774508336' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1735243029774508336'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1735243029774508336'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/11/using-hudson-to-manage-crons.html' title='Using Hudson to manage crons'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8898266844628279336</id><published>2009-10-20T03:52:00.000-07:00</published><updated>2009-10-20T03:58:40.272-07:00</updated><title type='text'>Clearing Linux Filesystem Cache</title><content type='html'>I was doing some performance tuning of our mysql db and was having some trouble consistently reproducing query performance due to IO caching that was occuring in Linux.  In case you're wondering, you can clear this cache by executing the following command as root:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;echo 1 &gt; /proc/sys/vm/drop_caches&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8898266844628279336?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8898266844628279336/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8898266844628279336' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8898266844628279336'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8898266844628279336'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/10/clearing-linux-filesystem-cache.html' title='Clearing Linux Filesystem Cache'/><author><name>Timo</name><uri>http://www.blogger.com/profile/05949421779840031276</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-2741671219768112700</id><published>2009-10-16T10:53:00.001-07:00</published><updated>2009-10-16T10:57:06.829-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bash'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><title type='text'>bash, errors, and pipes</title><content type='html'>Our typical pattern for writing bash scripts has been to start off each script with:&lt;br /&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;#!/bin/bash -e&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;The &lt;code&gt;-e&lt;/code&gt; option will cause the script to exit immediate if a command has exited with a non-zero status.  This way your script will fail as early as possible, and you never get into a case where on the surface, it looks like the script completed, but you're left with an empty file, or missing lines, etc.&lt;br /&gt;&lt;br /&gt;Of course, this is only for "simple" commands, so in practice, you can think of it terminating immediately if the entire line fails.  So a script like:&lt;pre class="prettyprint" style="border: none;"&gt;&lt;br /&gt;#!/usr/bin/bash -e&lt;br /&gt;/usr/bin/false || true&lt;br /&gt;echo "i am still running"&lt;br /&gt;&lt;/pre&gt;will still print "i am still running," and the script will exit with a zero exit status.&lt;br /&gt;&lt;br /&gt;Of course, if you wrote it that way, that's probably what you're expecting.  And, it's easy enough to change (just change "||" to "&amp;&amp;").&lt;br /&gt;&lt;br /&gt;The thing that was slightly surprising to me was how a script would behave using pipes.&lt;pre class="prettyprint" style="border: none;"&gt;&lt;br /&gt;#!/bin/bash -e&lt;br /&gt;/usr/bin/false | sort &gt; sorted.txt&lt;br /&gt;echo "i am still running"&lt;br /&gt;&lt;/pre&gt;If your script is piping its output to another command, it turns out that the return status of a pipeline is the exit status of its last command.  So, the script above will also print "i am still running" and exit with a 0 exit status.&lt;br /&gt;&lt;br /&gt;Bash provides a &lt;code&gt;PIPESTATUS&lt;/code&gt; variable, which is an array containing a list of the exit status values from the pipeline.  So, if we checked &lt;code&gt;${PIPESTATUS[0]}&lt;/code&gt; it would contain 1 (the exit value of &lt;code&gt;/usr/bin/false&lt;/code&gt;), and &lt;code&gt;${PIPESTATUS[1]}&lt;/code&gt; would contain 0 (exit value of sort).  Of course, &lt;code&gt;PIPESTATUS&lt;/code&gt; is volatile, so, you must check it immediately.  Any other command you run will affect its value.&lt;br /&gt;&lt;br /&gt;This is great, but not exactly what I wanted.  Luckily, there's another bash option &lt;code&gt;-o pipefail&lt;/code&gt;, which will change the way the pipeline exit code is derived.  Instead of being the last command, it will become the last command with a non-zero exit status.  So&lt;pre class="prettyprint" style="border: none;"&gt;&lt;br /&gt;#!/bin/bash -e -o pipefail&lt;br /&gt;/usr/bin/false | sort &gt; sorted.txt&lt;br /&gt;echo "this line will never execute"&lt;br /&gt;&lt;/pre&gt;So, thanks to pipefail, the above script will work as we expect.  Since &lt;code&gt;/usr/bin/false&lt;/code&gt; returns a non-zero exit status, the entire pipeline will return a non-zero exit status, the script will die immediately because of &lt;code&gt;-e&lt;/code&gt;, and the echo will never execute.&lt;br /&gt;&lt;br /&gt;Of course, all of this information is contained in the bash man page, but I had never really ran into it / looked into it before, and I thought it was interesting enough to write up.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-2741671219768112700?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/2741671219768112700/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=2741671219768112700' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2741671219768112700'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/2741671219768112700'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/10/bash-errors-and-pipes.html' title='bash, errors, and pipes'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-3844778815295836769</id><published>2009-10-12T20:07:00.000-07:00</published><updated>2009-10-13T12:05:45.431-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='s3'/><title type='text'>s3fsr 1.4 released</title><content type='html'>&lt;a href="http://github.com/stephenh/s3fsr"&gt;s3fsr&lt;/a&gt; is a tool we built at Bizo to help quickly get files into/out of S3. It's had a few 1.x releases, but by 1.4 we figured it was worth getting around to posting about.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Overview&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;While there a lot of great S3 tools out there, s3fsr's niche is that it's a &lt;a href="http://fuse.sourceforge.net/"&gt;FUSE&lt;/a&gt;/Ruby user land file system.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For a command line user, this is handy, because it means you can do:&lt;/div&gt;&lt;blockquote class="prettyprint"&gt;&lt;code&gt;# mount yourbucket in ~/s3&lt;br /&gt;s3fsr yourbucketname ~/s3&lt;br /&gt;&lt;br /&gt;# see the directories/files&lt;br /&gt;ls ~/s3/&lt;br /&gt;&lt;br /&gt;# upload&lt;br /&gt;mv ~/local.txt ~/s3/remotecopy.txt&lt;br /&gt;&lt;br /&gt;# download&lt;br /&gt;cp ~/s3/remote.txt ~/localcopy.txt&lt;/code&gt;&lt;/blockquote&gt;&lt;div&gt;Behind the scenes, s3fsr is talking to the Amazon S3 REST API and getting/putting directory and file content. It will cache directory listings (not file content), so ls/tab completion will be quick after the initial short delay.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;b&gt;S3 And Directory Conventions&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A unique aspect of s3fsr, and a specific annoyance it was written to fulfill, is that it understands several different directory conventions used by various S3 tools.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This directory convention problem stems from Amazon's decision to forgo any explicit notion of directories in the API, and instead force everyone to realize that S3 is not a file system but a giant hash table of string key -&gt; huge byte array.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Let's take an example--you want to store two files, "/dir1/foo.txt" and "/dir1/bar.txt" in S3. In a traditional file system, you'd have 3 file system entries: "/dir1", "/dir1/foo.txt", and "/dir1/bar.txt". Note that "/dir1" gets its own entry.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In S3, without tool-specific conventions, storing "/dir1/foo.txt" and "/dir1/bar.txt" really means only 2 entries. "/dir1" does not exist of its own accord. The S3 API, when reading and writing, never parses keys apart by "/", it just treats the whole path as one big key to get/set in its hash table.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For Amazon, this "no /dir1" approach makes sense due to the scale of their system. If they let you have a "/dir1" entry, pretty soon API users would want the equivalent of a "rm -fr /dir1", which, for Amazon, means instead of a relatively simple "remove the key from the hash table" operation, they have to start walking a hierarchical structure and deleting child files/directories as they go.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When the keys are strewn across a distributed hash table like Dynamo, this increases the complexity and makes the runtime nondeterministic.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Which Amazon, being a bit OCD about their SLAs and 99th percentiles, doesn't care for.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So, no S3 native directories.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;There is one caveat--the S3 API lets you progressively infer the existence of directories by probing the hash table keys with prefixes and delimiters.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In our example, if you probe with "prefix=/" and "delimiter=/", S3 will then, and only then, split &amp;amp; group the "/dir1/foo.txt" and "/dir1/bar.txt" keys on "/" and return you just "dir1/" as what the S3 API calls a "common prefix".&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Which is kind of like a directory. Except that you have to create the children first, and then the directory pops into existence. Delete the children, and the directory pops out of existence.&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;div&gt;This brings us to the authors of tools like s3sync and S3 Organizer--their users want the familiar "make a new directory, double click it, make a new file in it" idiom, not a backwards "make the children files first" idiom. It is, understandably, different from what users expect.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So, the tool authors got creative and basically added their own "/dir1" marker entries to S3 when users' perform a "new directory" operation to get back to the "directory first" idiom.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note this is a hack, because issuing a "REMOVE /dir1" to S3 will not recursively delete the child files, because to S3 "/dir1" is just a meaningless key with no relation to any other key in the hash table). So now the burden is on the tool to do its own recursive iteration/deletion of the directories.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Which is cool, and actually works pretty well, except that the two primary tools implemented marker entries differently:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://s3sync.net/wiki"&gt;s3sync&lt;/a&gt; created marker entries (e.g. a "/dir1" entry) with a hard-coded content that etags (hashes) to a specific value. This known hash is nice because it makes it easy to distinguish directory entries from file entries when listing S3 entries and, S3 knowing nothing about directories, the tool having to infer on its own which keys represent files and which represent directories.&lt;/li&gt;&lt;li&gt;&lt;a href="https://addons.mozilla.org/en-US/firefox/addon/3247"&gt;S3 Organizer&lt;/a&gt; created marker entries as well, but instead of a known etag/hash, they suffixed the directory name, so the key of "/dir1" is actually "/dir1_$folder$". It's then the job of the tool is recognize the suffix as a marker directory entry, strip off the suffix before showing the name to the user, and use a directory icon instead of a file icon.&lt;/li&gt;&lt;/ul&gt;So, if you use a S3 tool that does not understand these 3rd party conventions, browsing a well-used bucket will likely end up looking odd with obscure/duplicate entries:&lt;/div&gt;&lt;blockquote class="prettyprint"&gt;&lt;code&gt;/dir1            # s3sync marker entry file&lt;br /&gt;/dir1            # common prefix directory&lt;br /&gt;/dir1/foo.txt    # actual file entry&lt;br /&gt;/dir2_$folder$   # s3 organizer maker entry file&lt;br /&gt;/dir2            # common prefix directory&lt;br /&gt;/dir2/foo.txt    # actual file entry&lt;/code&gt;&lt;/blockquote&gt;&lt;div&gt;This quickly becomes annoying.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;And so s3fsr understands all three conventions, s3sync, S3 Organizer, and common prefixes, and just generally tries to do the right thing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;b&gt;FUSE Rocks&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One final note is that the &lt;a href="http://fuse.sourceforge.net/"&gt;FUSE&lt;/a&gt; project is awesome. Implementing mountable file systems that users can "ls" around in usually involves messy, error-prone kernel integration that is hard to write and, if the file system code misbehaves, can screw up your machine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;FUSE takes a different approach and does the messy kernel code just once, in the FUSE project itself, and then it acts as a proxy out to your user-land, process-isolated, won't-blow-up-the-box process to handle the file system calls.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This proxy/user land indirection does degrade performance, so you wouldn't use it for your main file system, but for scenarios like s3fsr, it works quite well.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;And FUSE language bindings like &lt;a href="http://rubyforge.org/projects/fusefs"&gt;fusefs&lt;/a&gt; for Ruby make it a cinch to develop too--s3fsr is all of 280 LOC.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Wrapping up&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Let us know if you find &lt;a href="http://github.com/stephenh/s3fsr"&gt;s3fsr&lt;/a&gt; useful--hop over to the github site, install the gem, kick the tires, and submit any feedback you might have.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-3844778815295836769?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/3844778815295836769/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=3844778815295836769' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3844778815295836769'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3844778815295836769'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/10/s3fsr-14-released.html' title='s3fsr 1.4 released'/><author><name>Stephen Haberman</name><uri>http://www.blogger.com/profile/05412274950722949930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-4592090672954062882</id><published>2009-10-12T16:13:00.000-07:00</published><updated>2009-10-12T16:28:06.874-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='job'/><title type='text'>Want to be challenged at work?</title><content type='html'>We've got a few challenges and are looking to grow our (kick ass) engineering team.  Check out the opportunities below and &lt;a href="mailto:donnie+eng-job-post@bizo.com"&gt;reach out&lt;/a&gt; if you think you've got what it takes...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://bit.ly/YebR"&gt;Operations Engineer&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://bit.ly/1X2I9C"&gt;Sales Engineer&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-4592090672954062882?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4592090672954062882'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4592090672954062882'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/10/want-to-be-challenged-at-work.html' title='Want to be challenged at work?'/><author><name>Donnie</name><uri>http://www.blogger.com/profile/13599133732419522440</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-4271488597683328958</id><published>2009-10-08T10:24:00.000-07:00</published><updated>2009-10-08T10:57:23.561-07:00</updated><title type='text'>Efficiently selecting random sub-collections.</title><content type='html'>Here's a handy algorithm for randomly choosing k elements from a collection of n elements (assume k &amp;lt; n)&lt;br /&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;public static &amp;lt;T&amp;gt; List&amp;lt;T&amp;gt; pickRandomSubset(Collection&amp;lt;T&amp;gt; source, int k, Random r) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;List&amp;lt;T&amp;gt; toReturn = new ArrayList&amp;lt;T&amp;gt;(k);&lt;br /&gt;&amp;nbsp;&amp;nbsp;double remaining = source.size();&lt;br /&gt;&amp;nbsp;&amp;nbsp;for (T item : source) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;double nextChance = (k - toReturn.size()) / remaining;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;if (r.nextDouble() &amp;lt; nextChance) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;toReturn.add(item);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;if (toReturn.size() == k) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;break;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;--remaining;&lt;br /&gt;&amp;nbsp;&amp;nbsp;}&lt;br /&gt;&amp;nbsp;&amp;nbsp;return toReturn;&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;The basic idea is to iterate through the source collection only once.  For each element, we can compute the probability that it should be selected, which simply equals the number of items left to pick divided by the total number of items left.&lt;br /&gt;&lt;br /&gt;Another nice thing about this algorithm is that it also works efficiently if the source is too large to fit in memory, provided you know (or can count) how many elements are in the source.&lt;br /&gt;&lt;br /&gt;This isn't exactly anything groundbreaking, but it's far better than my first inclination to use library functions to randomly sort my list before taking a leading sublist.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-4271488597683328958?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/4271488597683328958/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=4271488597683328958' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4271488597683328958'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4271488597683328958'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/10/efficiently-selecting-random-sub.html' title='Efficiently selecting random sub-collections.'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-3736872263908743511</id><published>2009-10-07T11:07:00.001-07:00</published><updated>2009-10-08T11:07:54.001-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><title type='text'>hive map reduce in java</title><content type='html'>In my last post, I went through an example of writing &lt;a href="http://dev.bizo.com/2009/10/reduce-scripts-in-hive.html"&gt;custom reduce scripts in hive&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Writing a streaming reducer requires a lot of the same work to check for when keys change.  Additionally, in java, there's a decent amount of boilerplate to go through just to read the columns from stdin.&lt;br /&gt;&lt;br /&gt;To help with this, I put together a really simple little framework that more closely resembles the hadoop Mapper and Reducer interfaces.&lt;br /&gt;&lt;br /&gt;To use it, you just need to write a really simple reduce method:&lt;br /&gt;&lt;pre class="prettyprint" style="border: none;"&gt;  void reduce(String key, Iterator&amp;lt;String[]&amp;gt; records, Output output);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The helper code will handle all IO, as well as the grouping together of records that have the same key.  The 'records' Iterator will run you through all rows that have the key specified in key.  It is assumed that the first column is the key.  Each element in the String[] record represents a column.  These rows aren't buffered in memory or anything, so it can handle any arbitrary number of rows.&lt;br /&gt;&lt;br /&gt;Here's the complete example from the my reduce example, in java (even shorter than perl).&lt;br /&gt;&lt;pre class="prettyprint" style="border: none;"&gt;public class Condenser {&lt;br /&gt;  public static void main(final String[] args) {&lt;br /&gt;    new GenericMR().reduce(System.in, System.out, new Reducer() {&lt;br /&gt;      void reduce(String key, Iterator&lt;string[]&gt; records, Output output) throws Exception {&lt;br /&gt;        final StringBuilder vals = new StringBuilder();&lt;br /&gt;        while (records.hasNext()) {&lt;br /&gt;          // note we use col[1] -- the key is provided again as col[0]&lt;br /&gt;          vals.append(records.next()[1]);&lt;br /&gt;          if (records.hasNext()) { vals.append(","); }&lt;br /&gt;        }&lt;br /&gt;        output.collect(new String[] { key, vals.toString() });&lt;br /&gt;      }&lt;br /&gt;    });&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/string[]&gt;&lt;/pre&gt;&lt;br /&gt;Here's a wordcount reduce example:&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint" style="border: none;"&gt;public class WordCountReduce {&lt;br /&gt;  public static void main(final String[] args) {&lt;br /&gt;    new GenericMR().reduce(System.in, System.out, new Reducer() {&lt;br /&gt;      public void reduce(String key, Iterator&amp;lt;String[]&amp;gt; records, Output output) throws Exception {&lt;br /&gt;        int count = 0;&lt;br /&gt;        &lt;br /&gt;        while (records.hasNext()) {&lt;br /&gt;          count += Integer.parseInt(records.next()[1]);&lt;br /&gt;        }&lt;br /&gt;        &lt;br /&gt;        output.collect(new String[] { key, String.valueOf(count) });&lt;br /&gt;      }&lt;br /&gt;    });&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Although the real value is in making it easy to write reducers, there's also support for helping with mappers.  Here's my key value split mapper from a previous example:&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint" style="border: none;"&gt;  public class KeyValueSplit {&lt;br /&gt;    public static void main(final String[] args) {&lt;br /&gt;      new GenericMR().map(System.in, System.out, new Mapper() {&lt;br /&gt;      public void map(String[] record, Output output) throws Exception {&lt;br /&gt;        for (final String kvs : record[0].split(",")) {&lt;br /&gt;          final String[] kv = kvs.split("=");&lt;br /&gt;          output.collect(new String[] { kv[0], kv[1] });&lt;br /&gt;        }&lt;br /&gt;      }&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The full source code is &lt;a href="http://github.com/ogrodnek/shmrj"&gt;available here&lt;/a&gt;.  Or you can download a &lt;a href="http://github.com/ogrodnek/shmrj/downloads"&gt;prebuilt jar here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The only dependency is &lt;a href="http://commons.apache.org/lang/"&gt;apache commons-lang&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I'd love to hear any feedback you may have.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-3736872263908743511?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/3736872263908743511/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=3736872263908743511' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3736872263908743511'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/3736872263908743511'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html' title='hive map reduce in java'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-643670035108860185</id><published>2009-10-06T13:37:00.000-07:00</published><updated>2009-10-06T15:27:56.952-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='firefox plugin'/><category scheme='http://www.blogger.com/atom/ns#' term='sdbtool'/><category scheme='http://www.blogger.com/atom/ns#' term='simple db'/><title type='text'>Simple DB Firefox Plugin -- New Release</title><content type='html'>I finally got around to updating our &lt;a href="http://github.com/floodfx/sdbtool"&gt;open-sourced&lt;/a&gt; Simple DB Firefox Plugin creatively named SDB Tool.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://com-bizo-public.s3.amazonaws.com/tools/sdbizo/screen_shots/sdbtool_ss_sm.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 450px; height: 120px; border: 1px solid black" src="http://com-bizo-public.s3.amazonaws.com/tools/sdbizo/screen_shots/sdbtool_ss_sm.jpg" border="1" /&gt;&lt;/a&gt;&lt;span style="font-size:78%;"&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;obligatory screen shot&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;The major highlights include:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Runs in Firefox 3.5!&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Support for "&lt;a href="http://docs.amazonwebservices.com/AmazonSimpleDB/2009-04-15/DeveloperGuide/index.html?UsingSelect.html"&gt;Select&lt;/a&gt;" Queries (e.g. Version &lt;a href="http://docs.amazonwebservices.com/AmazonSimpleDB/2009-04-15/DeveloperGuide/index.html"&gt;2009-04-15&lt;/a&gt; of the API)&lt;/li&gt;&lt;li&gt;Lots of UI Tweaks and Refactoring...&lt;/li&gt;&lt;/ul&gt;Please report any issues &lt;a href="http://code.google.com/p/sdbtool/issues/list"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Click &lt;a href="http://s3.amazonaws.com/com-bizo-public/tools/sdbizo/sdbizo.xpi"&gt;here&lt;/a&gt; to install.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-643670035108860185?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/643670035108860185/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=643670035108860185' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/643670035108860185'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/643670035108860185'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/10/simple-db-firefox-plugin-new-release.html' title='Simple DB Firefox Plugin -- New Release'/><author><name>Donnie</name><uri>http://www.blogger.com/profile/13599133732419522440</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7439422258256572536</id><published>2009-10-06T09:31:00.000-07:00</published><updated>2009-10-08T11:00:39.044-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><title type='text'>reduce scripts in hive</title><content type='html'>In a previous post, I discussed &lt;a href="http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html"&gt;writing custom map scripts in hive&lt;/a&gt;.  Now, let's talk about reduce tasks.&lt;br /&gt;&lt;h2&gt;The basics&lt;/h2&gt;As before, you are not writing an org.apache.hadoop.mapred.Reducer class.  Your reducer is just a simple script that reads from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t).&lt;br /&gt;&lt;br /&gt;Another thing to mention is that you can't run a reduce without first doing a map.&lt;br /&gt;&lt;br /&gt;The rows to your reduce script will be sorted by key (you specify which column this is), so that all rows with the same key will be consecutive.  One thing that's kind of a pain with hive reducers, is that you need to keep track of when keys change yourself.  Unlike a hadoop reducer where you get a (K key, Iterator&amp;lt;V&amp;gt; values), here you just get row after row of columns.&lt;br /&gt;&lt;h2&gt;An example&lt;/h2&gt;We'll use a similar example to the &lt;a href="http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html"&gt;map script&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;We will attempt to condense a table (kv_input) that looks like:&lt;br /&gt;&lt;pre&gt;k1 v1&lt;br /&gt;k2 v1&lt;br /&gt;k4 v1&lt;br /&gt;k2 v3&lt;br /&gt;k3 v1&lt;br /&gt;k1 v2&lt;br /&gt;k4 v2&lt;br /&gt;k2 v2&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;into one (kv_condensed) that looks like:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;k1 v1,v2&lt;br /&gt;k2 v1,v2,v3&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;h2&gt;The reduce script&lt;br /&gt;&lt;/h2&gt;&lt;pre class="prettyprint" style="border: none;"&gt;#!/usr/bin/perl                                                                                       &lt;br /&gt;&lt;br /&gt;undef $currentKey;&lt;br /&gt;@vals=();&lt;br /&gt;&lt;br /&gt;while (&amp;lt;STDIN&amp;gt;) {&lt;br /&gt;  chomp();&lt;br /&gt;  processRow(split(/\t/));&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;output();&lt;br /&gt;&lt;br /&gt;sub output() {&lt;br /&gt;  print $currentKey . "\t" . join(",", sort @vals) . "\n";&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;sub processRow() {&lt;br /&gt;  my ($k, $v) = @_;&lt;br /&gt;&lt;br /&gt;  if (! defined($currentKey)) {&lt;br /&gt;    $currentKey = $k;&lt;br /&gt;    push(@vals, $v);&lt;br /&gt;    return;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  if ($currentKey ne $k) {&lt;br /&gt;    output();&lt;br /&gt;    $currentKey = $k;&lt;br /&gt;    @vals=($v);&lt;br /&gt;    return;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  push(@vals, $v);&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Please forgive my perl.  It's been a long time (I usually write these in java, but thought perl would make for an easier blog example).&lt;br /&gt;&lt;br /&gt;As you can see, a lot of the work goes in to just keeping track of when the keys change.&lt;br /&gt;&lt;br /&gt;The nice thing about these simple reduce scripts is that it's very easy to test locally, without going through hadoop and hive.  Just call your script and pass in some example text separated by tabs.  If you do this, you need to remember to sort the input by key before passing into your script (this is usually done by hadoop/hive).&lt;br /&gt;&lt;style type="text/css"&gt;&lt;br /&gt;.line_numbers {&lt;br /&gt;  color: gray;&lt;br /&gt;}&lt;br /&gt;.script {&lt;br /&gt;  color: black;&lt;br /&gt;}&lt;br /&gt;&lt;/style&gt;&lt;br /&gt;&lt;h2&gt;Reducing from Hive&lt;/h2&gt;Okay, now that we have our reduce script working, let's run it from Hive.&lt;br /&gt;&lt;br /&gt;First, we need to add our map and reduce scripts:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;add file &lt;a href="http://com-bizo-public.s3.amazonaws.com/hive/reduce/identity.pl"&gt;identity.pl&lt;/a&gt;;&lt;br /&gt;add file &lt;a href="http://com-bizo-public.s3.amazonaws.com/hive/reduce/condense.pl"&gt;condense.pl&lt;/a&gt;;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now for the real work:&lt;br /&gt;&lt;table height="213" style="width: 305px;"&gt;&lt;br /&gt;&lt;tbody&gt;&lt;tr&gt;&lt;br /&gt;&lt;td&gt;&lt;br /&gt;&lt;pre class="line_numbers"&gt;01&lt;br /&gt;02&lt;br /&gt;03&lt;br /&gt;04&lt;br /&gt;05&lt;br /&gt;06&lt;br /&gt;07&lt;br /&gt;08&lt;br /&gt;09&lt;br /&gt;10&lt;br /&gt;11&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/td&gt;&lt;br /&gt;&lt;td&gt;&lt;br /&gt;&lt;pre class="script"&gt;from (&lt;br /&gt;  from kv_input&lt;br /&gt;  MAP k, v&lt;br /&gt;  USING './&lt;a href="http://com-bizo-public.s3.amazonaws.com/hive/reduce/identity.pl"&gt;identity.pl&lt;/a&gt;'&lt;br /&gt;  as k, v&lt;br /&gt; cluster by k) map_output&lt;br /&gt;insert overwrite table kv_condensed&lt;br /&gt;reduce k, v&lt;br /&gt;  using './&lt;a href="http://com-bizo-public.s3.amazonaws.com/hive/reduce/condense.pl"&gt;condense.pl&lt;/a&gt;'&lt;br /&gt;  as k, v&lt;br /&gt;;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/td&gt;&lt;br /&gt;&lt;/tr&gt;&lt;br /&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;This is fairly dense, so I will attempt to give a line by line breakdown:&lt;br /&gt;&lt;br /&gt;On line 3 we are specifying the columns to pass to our reduce script from the input table (specified on line 2).&lt;br /&gt;&lt;br /&gt;As I mentioned, You must specify a map script in order to reduce.  For this example, we're just using a simple identity perl script.  On line 5 we name the columns the map script will output.&lt;br /&gt;&lt;br /&gt;Line 6 specifies the column which is the key.  This is how the rows will be sorted when passed to your reduce script.&lt;br /&gt;&lt;br /&gt;Line 8 specifies the columns to pass into our reducer (from the map output columns on line 5).&lt;br /&gt;&lt;br /&gt;Finally, line 10 names the output columns from our reducer.&lt;br /&gt;&lt;br /&gt;(Here's my &lt;a href="http://com-bizo-public.s3.amazonaws.com/hive/reduce/red_example.txt"&gt;full hive session&lt;/a&gt; for this example, and an &lt;a href="http://com-bizo-public.s3.amazonaws.com/hive/reduce/kv_input.txt"&gt;example input file&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;I hope this was helpful.  Next time, I'll talk about some java code I put together to simplify the process of writing reduce scripts.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7439422258256572536?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7439422258256572536/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7439422258256572536' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7439422258256572536'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7439422258256572536'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/10/reduce-scripts-in-hive.html' title='reduce scripts in hive'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8479842031582485979</id><published>2009-10-05T11:14:00.000-07:00</published><updated>2009-10-05T11:53:34.847-07:00</updated><title type='text'>Developing on the Scala console with JavaRebel</title><content type='html'>If you're the type of developer who likes to mess around interactively with your code, you should definitely be using the Scala console.  Even if you're not actually using any Scala in your code, you can still instantiate your Java classes, call their methods, and play around with the results.  Here's a handy script that I stick in the top-level of my Eclipse projects that will start an interactive console with my compiled code on the classpath:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;#!/bin/bash&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;tempfile=`mktemp /tmp/tfile.XXXXXXXXXX`&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;/usr/bin/java -jar /mnt/bizo/ivy-script/ivy.jar -settings /mnt/bizo/ivy-script/ivyconf.xml -cachepath ${tempfile} &gt; /dev/null&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;classpath=`cat ${tempfile} | tr -d "\n\r"`&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;rm ${tempfile}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;exec /usr/bin/java -classpath /opt/local/share/scala/lib:target/classes:${classpath} -noverify -javaagent:/opt/javarebel/javarebel.jar  scala.tools.nsc.MainGenericRunner&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(Since we already use Ivy for dependency management, this script also pulls in the appropriate jar files from the Ivy cache.  See &lt;a href="http://dev.bizo.com/2009/07/dependency-management-for-scala-scripts.html"&gt;this post&lt;/a&gt; for more details.)&lt;br /&gt;&lt;br /&gt;The javaagent I'm using here is &lt;a href="http://www.zeroturnaround.com/jrebel/"&gt;JavaRebel&lt;/a&gt;, a really awesome tool that provides automatic code reloading at runtime.  Using the Scala console and JavaRebel, I can instantiate an object on the console and test a method.  If I get an unexpected result, I can switch back to Eclipse, fix a bug or add some additional logging, and rerun the exact same method back on the console.  JavaRebel will automagically detect that the class file was changed and reload it into the console, and the changes will even be reflected in the objects I created beforehand.&lt;br /&gt;&lt;br /&gt;The icing on this cake is that Zero Turnaround (the makers of JavaRebel) is giving away &lt;a href="http://www.zeroturnaround.com/scala-license/"&gt;free licenses&lt;/a&gt; to Scala developers.  How awesome is that?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8479842031582485979?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8479842031582485979/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8479842031582485979' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8479842031582485979'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8479842031582485979'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/10/developing-on-scala-console-with.html' title='Developing on the Scala console with JavaRebel'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8690191868410062070</id><published>2009-09-10T11:02:00.000-07:00</published><updated>2009-09-10T13:00:45.108-07:00</updated><title type='text'>Running ScalaTest BDD Tests from Eclipse</title><content type='html'>At Bizo, we're using Scala for a few things here and there. While investigating testing approaches for Scala, I came across &lt;a href="http://www.artima.com/scalatest/"&gt;ScalaTest&lt;/a&gt; and its Behavior Driven Development (BDD) spec approach.&lt;br /&gt;&lt;br /&gt;While its a small thing, I really like the sentence-based it "should do this and that" aspect of the spec approach. You get great readability compared to traditional "testDoThisAndThat" method names.&lt;br /&gt;&lt;br /&gt;However, a large downside to the spec approach is that spec tests cannot, on their own, be easily, one-keyboard-shortcut run from within Eclipse. The built-in Eclipse JUnit test runner does not understand the describe/it-based test structure.&lt;br /&gt;&lt;br /&gt;To solve this, I &lt;a href="http://draconianoverlord.com/2009/09/10/scalatest-spec-from-eclipse.html"&gt;wrote a class&lt;/a&gt; that can be used with JUnit's "RunWith" annotation to bridge the gap between JUnit and ScalaTest. Its not perfect, but you get back the one-shortcut/greenbar runner in Eclipse. So I can definitely see it being handy if we decide to do any spec-based testing here at Bizo.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8690191868410062070?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8690191868410062070/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8690191868410062070' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8690191868410062070'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8690191868410062070'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/09/running-scalatest-bdd-tests-from.html' title='Running ScalaTest BDD Tests from Eclipse'/><author><name>Stephen Haberman</name><uri>http://www.blogger.com/profile/05412274950722949930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-6074254034905294829</id><published>2009-09-09T15:41:00.001-07:00</published><updated>2009-09-09T15:41:45.363-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GWT'/><category scheme='http://www.blogger.com/atom/ns#' term='macosx'/><title type='text'>GWT hosted mode on snow leopard</title><content type='html'>One of the first things I noticed after installing Snow Leopard was that GWT hosted mode no longer worked.  You'll see the message "You must use a Java 1.5 runtime to use GWT Hosted Mode on Mac OS X."  After spending about 10 minutes convincing myself that I was in fact using jdk1.5 for eclipse, ant, etc., and like, wasn't this working last week? I finally looked at the jdk symlinks in JavaVM.framework and figured out that 1.5 was just pointing to 1.6...  interesting.&lt;br /&gt;&lt;br /&gt;There was some &lt;a href="http://groups.google.com/group/google-web-toolkit/browse_thread/thread/e9fcc378d8b48733/31b40b14eb9cd5c0?show_docid=31b40b14eb9cd5c0"&gt;discussion on the GWT group&lt;/a&gt;, along with a proposed fix of downloading someone's packaged leopard JDK and changing the symlinks.  Not a great fix...&lt;br /&gt;&lt;br /&gt;The Lombardi development team has come up with a &lt;a href="http://development.lombardi.com/?p=1012"&gt;great work-around&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I put together a &lt;a href="http://com-bizo-public.s3.amazonaws.com/code/gwt/gwt-dev-mac-snow-1.6.4.jar"&gt;jar with the modified BootStrapPlatform code&lt;/a&gt; (contains both .class and .java), or get &lt;a href="http://com-bizo-public.s3.amazonaws.com/code/gwt/BootStrapPlatform.java"&gt;just the src here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Here are some step-by-step instructions for getting this working in Eclipse:&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;add the gwt-dev-mac-snow jar to your Java Build path.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;in Java Build Path -&gt; Order and Export, move the gwt-dev-mac-snow jar above the GWT SDK Library.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;go to Run-&gt;Run Configurations.  In Web Applications-&gt;(your GWT project), click on Arguments, then add -d32 under VM arguments.&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;That's it!  You should now be able to run GWT hosted mode on Snow Leopard.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-6074254034905294829?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/6074254034905294829/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=6074254034905294829' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6074254034905294829'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6074254034905294829'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/09/gwt-hosted-mode-on-snow-leopard.html' title='GWT hosted mode on snow leopard'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-1121040376745585942</id><published>2009-08-11T10:52:00.000-07:00</published><updated>2009-08-11T11:04:15.544-07:00</updated><title type='text'>Setting up AWS keys for Eclipse</title><content type='html'>One somewhat annoying thing about running JUnit tests in Eclipse is that they do not inherit your system's environment variables.  There are good reasons for this, but we pass our AWS credentials to all of our applications via system variable, and it's a pain to add these to every single run configuration that needs them.  This gets especially tedious when a significant number of your JUnit tests require AWS access.&lt;br /&gt;&lt;br /&gt;As a workaround, you can add "Default VM Arguments" to the JVM you use to run your tests.  Simply go to "Preferences-&gt;Java-&gt;Installed JREs" and edit your default JVM.  Right under the JRE name is a space to add default VM arguments.  I simply added "&lt;span style="font-family: courier new;"&gt;-DAWS_SECRET_ACCESS_KEY=foo -DAWS_ACCESS_KEY_ID=bar&lt;/span&gt;", and now I no longer need to manually edit individual run configurations.&lt;br /&gt;&lt;br /&gt;This method seems a bit hacky to me, but until I can get a global run configuration, it definitely beats manually setting common environment variables for individual tests.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-1121040376745585942?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/1121040376745585942/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=1121040376745585942' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1121040376745585942'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/1121040376745585942'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/08/setting-up-aws-keys-for-eclipse.html' title='Setting up AWS keys for Eclipse'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-6864489159165677500</id><published>2009-07-22T14:15:00.000-07:00</published><updated>2009-07-22T14:44:00.567-07:00</updated><title type='text'>Dependency management for Scala scripts using Ivy</title><content type='html'>I'm quickly becoming a huge fan of Scala scripting.  Because Scala is Java-compatible, we can easily use our existing Java code base in scripts.  This is especially convenient as we're moving our reporting to Hive, which supports script-based Hadoop streaming for custom Mappers and Reducers.&lt;br /&gt;&lt;br /&gt;The one very annoying thing about Scala scripting is managing dependencies.  My initial method was to have my bash preamble manually download the required libraries to the current directory and insert them onto the Scala classpath.  So, my scripts looked something like this:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;#!/bin/sh&lt;br /&gt;&lt;br /&gt;if [ ! -f commons-lang.jar ]; then&lt;br /&gt;s3cmd get [s3-location]/commons-lang.jar commons-lang.jar&lt;br /&gt;fi&lt;br /&gt;&lt;br /&gt;if [ ! -f google-collect.jar ]; then&lt;br /&gt;s3cmd get [s3-location]/google-collect.jar google-collect.jar&lt;br /&gt;fi&lt;br /&gt;&lt;br /&gt;if [ ! -f hadoop-core.jar ]; then&lt;br /&gt;s3cmd get [s3-location]/hadoop-core.jar hadoop-core.jar&lt;br /&gt;fi&lt;br /&gt;&lt;br /&gt;exec /opt/local/bin/scala -classpath commons-lang.jar:google-collect.jar:hadoop-core.jar $0 $@&lt;br /&gt;&lt;br /&gt;!#&lt;br /&gt;(scala code here)&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This method has some rather severe scaling problems as the complexity of the dependency graph increases.  I was about to step into the endless cycle of testing my script, finding the missing or conflicting dependencies, and re-editing it to download and include the appropriate files.&lt;br /&gt;&lt;br /&gt;Fortunately, there was an easy solution.  We're already using &lt;a href=""&gt;Ivy&lt;/a&gt; to manage our dependencies in our compiled projects, and Ivy can be run in standalone mode outside of ant.  The key option to use is the "-cachepath" command line option, which causes Ivy to write a classpath to the cached dependencies to a specified file.  So, now the preamble of my scripts looks like this:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;#!/bin/bash&lt;br /&gt;&lt;br /&gt;tempfile=`mktemp /tmp/tfile.XXXXXXXXXX`&lt;br /&gt;&lt;br /&gt;/usr/bin/java -jar /mnt/bizo/ivy-script/ivy.jar -settings /mnt/bizo/ivy-script/ivyconf.xml -cachepath ${tempfile} &gt; /dev/null&lt;br /&gt;&lt;br /&gt;classpath=`cat ${tempfile} | tr -d "\n\r"`&lt;br /&gt;&lt;br /&gt;rm ${tempfile}&lt;br /&gt;&lt;br /&gt;exec /opt/local/bin/scala -classpath ${classpath} $0 $@&lt;br /&gt;&lt;br /&gt;!#&lt;br /&gt;(scala code here)&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Now all I need is a standard ivy.xml file living next to my script, and Ivy will automagically resolve all of my dependencies and insert them into the script's classpath for me.&lt;br /&gt;&lt;br /&gt;Crisis averted.  Life is once again filled with joy and happiness.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-6864489159165677500?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/6864489159165677500/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=6864489159165677500' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6864489159165677500'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/6864489159165677500'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/07/dependency-management-for-scala-scripts.html' title='Dependency management for Scala scripts using Ivy'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8513541888780989360</id><published>2009-07-16T11:35:00.000-07:00</published><updated>2009-07-16T13:25:46.840-07:00</updated><title type='text'>Pruning EBS Snapshots</title><content type='html'>We've been using Amazon's Elastic Block Storage (&lt;a href="http://aws.amazon.com/ebs/"&gt;EBS&lt;/a&gt;) for some time now.  In a nutshell, EBS is like a "hard drive for the AWS cloud".  You simply create an EBS volume and then mount it on your &lt;a href="http://aws.amazon.com/ec2/"&gt;EC2&lt;/a&gt; instance.  You then read/write to it as if it were local storage.  For a good intro to EBS, check out this &lt;a href="http://www.rightscale.com/"&gt;RightScale&lt;/a&gt; blog &lt;a href="http://blog.rightscale.com/2008/08/20/amazon-ebs-explained/"&gt;post&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The snapshots feature of EBS is especially handy as it allows you to easily backup the data on your EBS volume.  AWS provides an API that allows you to request a snapshot.  The API call will return immediately and then, in the background, the backup will occur and eventually be uploaded to S3.&lt;br /&gt;&lt;br /&gt;While the snapshots feature is useful, one of the issues that you will likely run into is the snapshot limit.  A standard AWS account allows you to have 500 EBS snapshots at any given time.  After this limit has been reached, you will no longer be able to create new snapshots.  So, you will need to have a strategy to 'prune' (remove) snapshots.&lt;br /&gt;&lt;br /&gt;I wasn't able to find any scripts for pruning EBS snapshots on the web so I ended up writing a little Ruby script to accomplish the task.&lt;br /&gt;&lt;br /&gt;You can get the script &lt;a href="http://github.com/timoteo/ebs_snapshot_pruning/blob/d0798b996f48d3c438d067e7bb2d3f65256999d9/prune_ebs_snapshots.rb"&gt;here&lt;/a&gt;.  It requires the excellent &lt;a href="http://rightaws.rubyforge.org/"&gt;right_aws&lt;/a&gt; ruby gem.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8513541888780989360?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8513541888780989360/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8513541888780989360' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8513541888780989360'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8513541888780989360'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/07/pruning-ebs-snapshots.html' title='Pruning EBS Snapshots'/><author><name>Timo</name><uri>http://www.blogger.com/profile/05949421779840031276</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-8305051437970079725</id><published>2009-07-14T17:20:00.001-07:00</published><updated>2009-07-14T17:21:30.138-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><title type='text'>custom map scripts and hive</title><content type='html'>First, I have to say that after using &lt;a href="http://wiki.apache.org/hadoop/Hive"&gt;Hive&lt;/a&gt; for the past couple of weeks and actually writing some real reporting tasks with it, it would be really hard to go back.  If you are writing straight hadoop jobs for any kind of report, please give hive a shot.  You'll thank me.&lt;br /&gt;&lt;br /&gt;Sometimes, you need to perform data transformation in a more complex way than SQL will allow (even with &lt;a href="http://bizo-dev.blogspot.com/2009/06/custom-udfs-and-hive.html"&gt;custom UDFs&lt;/a&gt;).  Specifically, if you want to return a different number of columns, or a different number of rows for a given input row, then you need to perform what hive calls a &lt;a href="http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform"&gt;transform&lt;/a&gt;.  This is basically a custom streaming map task.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;The basics&lt;/h2&gt;&lt;br /&gt;1. You are not writing an org.apache.hadoop.mapred.Mapper class!  This is just a simple script that reads rows from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t).  It's probably worth mentioning this again but you shouldn't be thinking Key Value, you need to think about columns.&lt;br /&gt;&lt;br /&gt;2. You can write your script in any language you want, but it needs to be available on all machines in the cluster.  Any easy way to do this is to take advantage of the hadoop distributed cache support, and just use add file /path/to/script within hive.  The script will then be distributed and can be run as just ./script (assuming it is executable), or 'perl script.pl' if it's perl, etc.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;An example&lt;/h2&gt;&lt;br /&gt;This is a simplified example, but recently I had a case where one of my columns contained a bunch of key/value pairs separated by commas:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;k1=v1,k2=v2,k3=v3,...&lt;br /&gt;k1=v1,k2=v2,k3=v3,...&lt;br /&gt;k1=v1,k2=v2,k3=v3,...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I wanted to transform these records into a 2 column table of k/v:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;k1	v1&lt;br /&gt;k2	v2&lt;br /&gt;k3	v3&lt;br /&gt;k1	v1&lt;br /&gt;k2	v2&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I wrote a simple perl script to handle the map, created the 2 column output table, then ran the following:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;-- add script to distributed cache&lt;br /&gt;add file /tmp/&lt;a href="http://com-bizo-public.s3.amazonaws.com/hive/mapper/split_kv.pl"&gt;split_kv.pl&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;-- run transform&lt;br /&gt;insert overwrite table test_kv_split&lt;br /&gt;select&lt;br /&gt;  transform (d.kvs)&lt;br /&gt;    using './split_kv.pl'&lt;br /&gt;    as (k, v)&lt;br /&gt;from&lt;br /&gt;  (select all_kvs as kvs from kv_input) d&lt;br /&gt;;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;As you can see, you can specify both the input and output columns as part of your transform statement.&lt;br /&gt;&lt;br /&gt;And... that's all there is to it.  Next time...  a reducer?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-8305051437970079725?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/8305051437970079725/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=8305051437970079725' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8305051437970079725'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/8305051437970079725'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html' title='custom map scripts and hive'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-5858974685804311578</id><published>2009-07-07T13:44:00.000-07:00</published><updated>2009-07-07T15:27:37.980-07:00</updated><title type='text'>Load testing with Tsung</title><content type='html'>One of the big issues with building scalable software is making tests scale along with the application.  A high performance web application should be tested under heavy loads, preferably to the breaking point.  Of course, now you need a second application that can generate lots of traffic.  You could use something simple like httperf; however, this doesn't work so well with complex systems, since you're only hitting one URL at a time.&lt;br /&gt;&lt;br /&gt;Enter &lt;a href="http://tsung.erlang-projects.org/"&gt;Tsung&lt;/a&gt;.  Tsung is a load testing tool written in Erlang (everybody's favorite scalable language) that can not only generate large amounts of traffic, but it can parametrize requests based on data returned by your web application or with data pulled from external files.  It also can generate very nice HTML reports using gnuplot.&lt;br /&gt;&lt;br /&gt;Here's how we're running Tsung on Ubuntu in EC2:&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;Start a new instance.  We're using an Ubuntu Hardy instance build by Alestic.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Download, configure, compile, and install &lt;a href="http://erlang.org/"&gt;Erlang&lt;/a&gt;.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Get and install the Tsung dependencies: gnuplot and perl5.&lt;br /&gt;&lt;li&gt;Download, configure, compile, and install &lt;a href="http://tsung.erlang-projects.org/"&gt;Tsung&lt;/a&gt;.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Install your favorite web server.  I prefer Apache HTTPD...others in this office perfer nginx.  If you want to be really Erlang-y, install Yaws or Mochiweb.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Configure your ~/.tsung/tsung.xml configuration file for your test.  The &lt;a href="http://tsung.erlang-projects.org/user_manual.html"&gt;Tsung user manual&lt;/a&gt; has pretty good documentation about how to do this.  Note that you do NOT want to use vm-transport for heavy loads, as this prevents Erlang from spawning additional virtual machines, which limits the number of requests you can use at a time.  This does require you to set up passwordless ssh access to localhost.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Point your web server at "~/.tsung/log/".  Each test you run will log the results in a subdirectory of this location.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Start your test with the "tsung start" command.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Set the report-generating script /usr/lib/tsung/bin/tsung_stats.pl to run in the appropriate log directory every 10 seconds.  You can do this via crons or simply having a "watch" command running in the background.&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;Now, you can just browse over to your machine to view the latest test report.  Tsung exposes all of the statistics you would expect (req/sec, throughput, latency, etc) both in numerical and graphical form.  All of the graphs can be downloaded as high quality postscript graphs, too.&lt;br /&gt;&lt;br /&gt;If you want to generate truly large amounts of traffic, Tsung supports distributed testing environments (as you might expect from an Erlang testing tool).  Just make sure that you have passwordless SSH set up between your test machines and configure the client list in your tsung.xml file appropriately.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-5858974685804311578?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/5858974685804311578/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=5858974685804311578' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5858974685804311578'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5858974685804311578'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/07/load-testing-with-tsung.html' title='Load testing with Tsung'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-7790006852025533499</id><published>2009-06-23T15:56:00.001-07:00</published><updated>2009-10-08T11:02:02.733-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='hive'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>custom UDFs and hive</title><content type='html'>We just started playing around with &lt;a href="http://wiki.apache.org/hadoop/Hive"&gt;Hive&lt;/a&gt;.  Basically, it lets you write your hadoop map/reduce jobs using a SQL-like language.  This is pretty powerful.  Hive also seems to be pretty extendable -- custom data/serialization formats, custom functions, etc.&lt;br /&gt;&lt;br /&gt;It turns out that writing your own UDF (user defined function) for use in hive is actually pretty simple.&lt;br /&gt;&lt;br /&gt;All you need to do is extend &lt;a href="http://hadoop.apache.org/hive/docs/current/api/org/apache/hadoop/hive/ql/exec/UDF.html"&gt;UDF&lt;/a&gt;, and write one or more evaluate methods with a hadoop Writable return type.  Here's an example of a complete implementation for a lower case function:&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint" style="border: none;"&gt;package com.bizo.hive.udf;&lt;br /&gt;&lt;br /&gt;import org.apache.hadoop.hive.ql.exec.UDF;&lt;br /&gt;import org.apache.hadoop.io.Text;&lt;br /&gt;&lt;br /&gt;public final class Lower extends UDF {&lt;br /&gt;  public Text evaluate(final Text s) {&lt;br /&gt;    if (s == null) { return null; }&lt;br /&gt;    return new Text(s.toString().toLowerCase());&lt;br /&gt;  }&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(Note that there's already a built-in function for this, it's just an easy example).&lt;br /&gt;&lt;br /&gt;As you've probably noticed from the import statements, you'll need to add buildtime dependencies for hadoop and hive_exec.&lt;br /&gt;&lt;br /&gt;The next step is to add the jar with your UDF code to the hive claspath. The easiest way I've found to do this is to set &lt;tt&gt;HIVE_AUX_JARS_PATH&lt;/tt&gt; to a directory containing any jars you need to add before starting hive.  Alternatively you can edit &lt;tt&gt;$HIVE_HOME/conf/hive-site.xml&lt;/tt&gt; with a &lt;tt&gt;hive.aux.jars.path&lt;/tt&gt; property.  Either way you need to do this before starting hive.  It looks like there's a patch out there to &lt;a href="https://issues.apache.org/jira/browse/HIVE-338"&gt;dynamically add/remove jars to the classpath&lt;/a&gt;, so, hopefully this will be easier soon.&lt;br /&gt;&lt;br /&gt;example:&lt;br /&gt;&lt;pre&gt;# directory containing any additional jars you want in the classpath&lt;br /&gt;export HIVE_AUX_JARS_PATH=/tmp/hive_aux&lt;br /&gt;&lt;br /&gt;# start hive normally&lt;br /&gt;/opt/hive/bin/hive&lt;/pre&gt;&lt;br /&gt;Once you have hive running, the last step is to register your function:&lt;br /&gt;&lt;pre&gt;create temporary function my_lower as 'com.bizo.hive.udf.Lower';&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now, you can use it:&lt;br /&gt;&lt;pre&gt;hive&gt; select my_lower(title), sum(freq) from titles group by my_lower(title);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;...&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;Ended Job = job_200906231019_0006&lt;br /&gt;OK&lt;br /&gt;cmo 13.0&lt;br /&gt;vp 7.0&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Although it's pretty simple, I didn't see this documented anywhere so I thought I would write it up.  I also added it to the &lt;a href="http://wiki.apache.org/hadoop/Hive/AdminManual/Plugins"&gt;wiki&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-7790006852025533499?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/7790006852025533499/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=7790006852025533499' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7790006852025533499'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/7790006852025533499'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/06/custom-udfs-and-hive.html' title='custom UDFs and hive'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-4732831405058317869</id><published>2009-06-11T17:30:00.000-07:00</published><updated>2009-06-11T17:40:36.346-07:00</updated><title type='text'>Force.com's SOAP/REST library for Google App Engine/Java</title><content type='html'>As long as I'm reflecting on our Google I/O experiences, I also want to point out what looks like a very useful library from Salesforce.  The &lt;a href="http://code.google.com/p/sfdc-wsc/"&gt;Force.com Web Services Connector&lt;/a&gt; is a toolkit designed to simplify calling WSDL-defined SOAP and REST services.  The best part is that they have a version that works on Google App Engine for Java!  (Make sure that you use wsc-gae-16_0.jar, not the regular version.)  &lt;br /&gt;&lt;br /&gt;I haven't had the chance to do a lot of development on GAE/J, but my colleagues have definitely had some &lt;a href="http://bizo-dev.blogspot.com/2009/04/calling-soap-web-services-on-google-app.html"&gt; headaches&lt;/a&gt; getting SOAP and REST calls working around the GAE/J whitelist.  Maybe one of them can comment after we give this toolkit a whirl.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-4732831405058317869?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/4732831405058317869/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=4732831405058317869' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4732831405058317869'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4732831405058317869'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/06/forcecoms-soaprest-library-for-google.html' title='Force.com&apos;s SOAP/REST library for Google App Engine/Java'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-5903834214886537058</id><published>2009-06-11T11:03:00.000-07:00</published><updated>2009-06-11T17:27:39.051-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Google Visualizations'/><category scheme='http://www.blogger.com/atom/ns#' term='Google I/O'/><title type='text'>Google Visualizations Java Data Source Library</title><content type='html'>As with any data-oriented company, most of our projects revolve around collecting data, processing data, and exposing data to users.  In that third category, we've been moving towards &lt;a href="http://code.google.com/apis/visualization/"&gt;Google Visualizations&lt;/a&gt; to draw our pretty graphs and charts.  So, while the free Android phone and Google Wave were attracting a lot of attention at Google I/O, from a practical standpoint, I was actually most excited about Google's new &lt;a href="http://code.google.com/apis/visualization/documentation/dev/dsl_about.html"&gt;Data Source Java Library&lt;/a&gt;.  We had previously written something similar to this in-house, but we were still working on some of the optional parts of the specification when this library was released.&lt;br /&gt;&lt;br /&gt;In a nutshell, Google Visualizations is a Javascript library that draws charts and graphs.  The data is inserted in one of three ways: programatically in Javascript, via a JSON object, or by pointing the Javascript at a Data Source URL.  For example, Google spreadsheets have built-in functionality to expose their contents as a Data Source, so you can just point the Javascript at a special URL, and a graph of your spreadsheet's data will pop up on your webpage.  If you use the last method, you can use Gadgets to easily create custom dashboards displaying your data.&lt;br /&gt;&lt;br /&gt;The Data Source Java Library makes it very easy to implement a Data Source backed by whatever internal data store you might be using -- it's just a matter of creating a DataTable object and populating it with data.  The library provides everything else, up to and including the servlet to drop into your web container.  (We ended up implementing a Spring controller instead.  The library provides helper code for this; I estimate using a Spring conroller instead of a servlet cost us four lines of code.)&lt;br /&gt;&lt;br /&gt;The best part is that it also implements a SQL-like query language for you, so you can expose your data in different forms (which are required by different visualizations) based on the parameters to the URL you call.  Dumping data into JSON objects is very straightforward.  Writing a parser and interpreter for queries is a real pain.&lt;br /&gt;&lt;br /&gt;The library lets you specify how much of the query language you want to implement and which parts you want to make the library worry about.  The only (small) complaint I have about this is that this configuration is rather coarsely defined -- we wanted to support basic column SELECTs (to improve performance on our backend) but have the library handle the aggregation functions (which our backend does not support).  It wasn't too tough working around this restriction, although it does cost us a bit of extra parsing (so we can get a copy of the complete query) and column filtering (because both our code and the library processes the SELECT phrase).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-5903834214886537058?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/5903834214886537058/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=5903834214886537058' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5903834214886537058'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/5903834214886537058'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/06/google-visualizations-java-data-source.html' title='Google Visualizations Java Data Source Library'/><author><name>Darren</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-4518600711618257691</id><published>2009-05-20T18:29:00.001-07:00</published><updated>2009-05-20T18:30:35.477-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='ec2'/><category scheme='http://www.blogger.com/atom/ns#' term='appengine'/><title type='text'>new version of s3-simple</title><content type='html'>Just committed some small changes to the &lt;a href="http://github.com/ogrodnek/s3-simple"&gt;s3-simple&lt;/a&gt; library for specifying ACLs, and/or arbitrary request headers/meta-data while storing keys.&lt;br /&gt;&lt;br /&gt;Example usage:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;S3Store s3 = new S3Store("s3.amazonaws.com", ACCESS_KEY, SECRET_KEY);&lt;br /&gt;s3.setBucket("my-bucket");&lt;br /&gt;&lt;br /&gt;// upload an item as public-read&lt;br /&gt;s3.storeItem("test", new String("hello").getBytes(), "public-read");&lt;br /&gt;&lt;br /&gt;// upload a js file, with a cache control-header&lt;br /&gt;final Map&amp;lt;String, List&amp;lt;String&amp;gt;&amp;gt; headers = new HashMap&amp;lt;String, List&amp;lt;String&amp;gt;&amp;gt;();&lt;br /&gt;headers.put("Cache-Control", Collections.singletonList("max-age=300, must-revalidate"));&lt;br /&gt;headers.put("Content-Type", Collections.singletonList("application/x-javascript"));&lt;br /&gt;&lt;br /&gt;s3.storeItem("test2.js", new String("document.write('hello');").getBytes(), "public-read", headers);&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cloud.github.com/downloads/ogrodnek/s3-simple/s3-shell-1.0.5.jar"&gt;Download it here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Currently, you can only do this while storing keys, and there's no way to retrieve this data later.  Still, it was enough of a pain to get this working correctly with the request signing, so I figured I'd share the code anyway.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5261056907132640554-4518600711618257691?l=dev.bizo.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dev.bizo.com/feeds/4518600711618257691/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5261056907132640554&amp;postID=4518600711618257691' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4518600711618257691'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5261056907132640554/posts/default/4518600711618257691'/><link rel='alternate' type='text/html' href='http://dev.bizo.com/2009/05/new-version-of-s3-simple.html' title='new version of s3-simple'/><author><name>larry ogrodnek</name><uri>http://www.blogger.com/profile/01105034385285773975</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5261056907132640554.post-739079243464027639</id><published>2009-05-08T09:57:00.000-07:00</published><updated>2009-05-08T11:24:00.513-07:00</updated><title type='text'>Work @ Bizo</title><content type='html'>We’re looking for an out-of-the-box thinker with a good sense-of-humor and a great attitude to join our product development team. As one of five software engineers for &lt;a href="http://www.bizo.com"&gt;Bizo&lt;/a&gt;, you will take responsibility for developing key components of the Bizographic Targeting Platform, a revolutionary new way to target business advertising online. You will be a key player on an incredible team as we build our world-beating, game-changing, and massively-scalable bizographic advertising and targeting platform. &lt;u&gt;In a nutshell, you will be working on difficult problems &lt;a href="http://twitter.com/follow_bizo"&gt;with&lt;/a&gt; &lt;a href="http://twitter.com/floodfx"&gt;cool&lt;/a&gt; &lt;a href="http://twitter.com/ogrodnek"&gt;people&lt;/a&gt;.&lt;/u&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;br /&gt;The Team:&lt;/span&gt;&lt;br /&gt;We’re a small team of very talented people (if we don’t say so ourselves!). We use Agile development methodologies. We care about high quality results, not how many hours you’re in the office. We develop on &lt;a href="http://store.apple.com/us/browse/home/shop_mac/family/macbook_pro?mco=MTE3MDE"&gt;Mac&lt;/a&gt;, run on the &lt;a href="http://aws.amazon.com"&gt;Cloud&lt;/a&gt;, and use &lt;a href="http://www.google.com/apps/intl/en/business/index.html"&gt;Google Apps&lt;/a&gt;. We don’t write huge requirements documents or &lt;a href="http://en.wikipedia.org/wiki/TPS_report"&gt;TPS reports&lt;/a&gt;. &lt;a href="http://www.google.com/search?q='there+is+no+spoon'"&gt;We believe there is no spoon!&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;br /&gt;The Ideal Candidate:&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Self-motivated&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Entrepreneurial / Hacker spirit&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Track record of achievement&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Hands on problem solver that enjoys cracking difficult nuts&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Enjoys working on teams&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Experience working with highly scalable systems&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Linux/Unix proficiency&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Amazon Web Services experience&lt;/li&gt; &lt;br /&gt;&lt;li&gt;Bachelors degree in Computer Science or related field – points for advanced degrees&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Willing to &lt;a href="http://bizo-dev.blogspot.com"&gt;blog&lt;/a&gt; about Bizo Dev :)&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;u&gt;Gets stuff done!&lt;/u&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Bonus points for MDS (Mad Bocce Skills)&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;br /&gt;Technical Highlights:&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Languages:  Java (90%), Javascript, Ruby (thinking about Scala, Clojure and Erlang too!)&lt;/li&gt;&lt;br /&gt;&lt;li&gt;All Clouds, all the time: Amazon Web Services, Google App Engine&lt;/li&gt;&lt;br /&gt;&lt;li&gt;  Frameworks/Libraries: &lt;a href="http://hadoop.apache.org/core/"&gt;Hadoop&lt;/a&gt;, &lt;a href="http://incubator.apache.org/thrift/"&gt;Thrift&lt;/a&gt;, &lt;a href="http://code.google.com/p/google-collections/"&gt;Google Collections&lt;/a&gt;, &lt;a href="http://springframework.org"&gt;Spring&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="mailto:donnie+dev_jobs@bizo.com"&gt;Send me&lt;/a&gt; your resume and a cover letter if you are interested in joining our team.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='ht
