Developer Blog

Java/Scala and Highly Scalable Systems on AWS

Beware Java Enums in Spark

23 February 2014

A few days back I wrote a Spark job that runs an A/B test to compare the conversion rates between two groups of website visitors on one of our client's websites.


Why we chose not to git fast forward merge

14 February 2014

This post comes from an interesting email thread we had. In the thread, Stephen was answering various questions from the team about why it is so important to NOT use fast-forward merges.


Crucible Survivor - a code review dashboard

14 February 2014

Crucible Survivor Dashboard


SCM Migration

26 August 2013

We happily used Atlassian’s hosted OnDemand service for source code management with the following setup


Using AWS Custom SSL Domain Names for CloudFront

20 June 2013

AWS recently announced the limited availability of Custom SSL Domain Names for CloudFront. You have to request an invitation in order to start using it but I am guessing it won't be long until it has been rolled out to all customers.


Scala Command-Line Hacks

22 April 2013

Do you like command-line scripting and one-liners with Perl, Ruby and the like?


Efficiency & Scalability

19 April 2013

Software engineers know that distributed systems are often hard to scale and many can intuitively point to reasons why this is the case by bringing up points of contention, bottlenecks and latency-inducing operations. gIndeed, there exists a plethora of reasons and explanations as to why most distributed systems are inherently hard to scale, from theCAP theoremto scarcity of certain resources, e.g., RAM, network bandwidth ...


Sensible Defaults for Apache HttpClient

15 April 2013

Before coming to Bizo, I wrote a web service client that retrieved daily XML reports over HTTP using the Apache DefaultHttpClient. Everything went fine until one day the connection simply hung forever. We found this odd because we had set the connection timeout. It turned out we also needed to set the socket timeout (HttpConnectionParams.SO_TIMEOUT). The default for both connection timeout (max time to wait for a connection) and socket timeout (max time to wait between consecutive data packets) is infinity. The server was accepting the connection but then not sending any data so our client hung forever without even reporting any errors. Rookie mistake, but everyone is a rookie at least once. Even if you are an expert with HttpClient, chances are there will be someone maintaining your code in the future who is not.


Map-side aggregations in Apache Hive

18 February 2013

When running large scale Hive reports, one error we occasionally run into is the following:


Reader Driven Development

15 February 2013

In this talk on Effective ML, Yaron Minsky talks about Reader Driven Development. That is, writing your code with the reader in mind. Making decisions that will make the code more easily read and understood by other developers down the line.